|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
scripting / data problem.I have a data file problem that I hope I can get an assist on. Not an
"Ubuntu" question really, but this is about the smartest group I know to ask! Note: I'm not asking anyone to do my homework. In fact, I'd be happy with the RTFM answer, as long as you could tell me which FM and why you say so. Unfortunately, it takes a bit to explain the problem, so this is kind of long. Here is the issue: I have a number of different, partially overlapping in the form of (2) MS Outlook contact databases, (1) MS Access database, (1) quickbooks and (1) SQL db. Though we don't have access the SQL available yet, we currently think that CSV files are the common denominator that everything can import and export. The problem is that the various lists don't all have the same information so I can't just cat them together and sort with a "unique" operator. That's a vague statement. Here is what I mean by file and field: file 1 - | email | file 2 - | email | f-name | l-name | file 3 - | f-name | l-name | company | phone | file 4 - | company | address | city | state | (By the way, each one has some duplicates with less than complete info.) What we need is to be able to populate all of these files with all the available data for each contact. The concept I have in mind is to compare the files in pairs. So file 1 would go into file-L (for Last) Then file 1 would be compared with each line of file-L and if there is a record with matching email, then any extra info is added , but not overwriting anything that is already in any field So the final file would look like this: | file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company | address | city | state | phone | The first four fields would have an "x" or something that could be subsequently used to parse them back to the originals. I have thought about scripting something that would, "for each line in file 1, use the fields in the line for variables, feed them to "awk" which tests file 2 and returns results to the script to be added to file-L" I have also thought that awk could do the whole thing, but I'd hate to plow through the entire user guide if someone told me it was hopeless. I also wondered whether I could create a mysql database with this info, and there are commands that can merge records (rather than just eliminate what looks like a duplicate). Or perhaps there is another strategy??? Obviously I have a lot of learning to do, but I am wondering about what? Thanks very much for all of you who have read this far. I'd appreciate any insight you may have into the best approach. --PE -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On Sat, Nov 07, 2009 at 05:31:25PM -0800, Patton Echols wrote:
> file 1 - | email | > file 2 - | email | f-name | l-name | > file 3 - | f-name | l-name | company | phone | > file 4 - | company | address | city | state | > > (By the way, each one has some duplicates with less than complete info.) > > What we need is to be able to populate all of these files with all the > available data for each contact. The concept I have in mind is to > compare the files in pairs. So file 1 would go into file-L (for Last) > Then file 1 would be compared with each line of file-L and if there is a > record with matching email, then any extra info is added , but not > overwriting anything that is already in any field So the final file > would look like this: > > | file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company > | address | city | state | phone | > [...] > I also wondered whether I could create a mysql database with this info, > and there are commands that can merge records (rather than just > eliminate what looks like a duplicate). I'll vote for the mysql approach. Create a csv for each file, import that into mysql so you have 4 tables. Then create a query that will dump out the data the way you want. You can dump that into another table or export it. For this to work, the fields will have to match in each table. So Comany in file3 is equivalent to Comany in file4 (not something like Acme, Inc vs Acme, Incorporated). SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a, file2 b, file3 c, file4 d WHERE a.email = b.email and concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested). Data manipulation is much more flexible once its in a database. At least when you have something this squirrelly. -- Hal -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.Thanks for the reply.
On 11/07/2009 06:26 PM, Hal Burgiss wrote: > On Sat, Nov 07, 2009 at 05:31:25PM -0800, Patton Echols wrote: > > >> file 1 - | email | >> file 2 - | email | f-name | l-name | >> file 3 - | f-name | l-name | company | phone | >> file 4 - | company | address | city | state | >> >> (By the way, each one has some duplicates with less than complete info.) >> >> What we need is to be able to populate all of these files with all the >> available data for each contact. The concept I have in mind is to >> compare the files in pairs. So file 1 would go into file-L (for Last) >> Then file 1 would be compared with each line of file-L and if there is a >> record with matching email, then any extra info is added , but not >> overwriting anything that is already in any field So the final file >> would look like this: >> >> | file-1 | file-2 | file-3 | file-4 | email | f-name | l-name | company >> | address | city | state | phone | >> >> > > [...] > > >> I also wondered whether I could create a mysql database with this info, >> and there are commands that can merge records (rather than just >> eliminate what looks like a duplicate). >> > > I'll vote for the mysql approach. Create a csv for each file, import that into > mysql so you have 4 tables. Then create a query that will dump out the data > the way you want. You can dump that into another table or export it. For this > to work, the fields will have to match in each table. So Comany in file3 is > equivalent to Comany in file4 (not something like Acme, Inc vs Acme, > Incorporated). > Do you mean that all the field data must match? Or just, for example, the email fields? Because part of the problem is that I have non matching / missing data depending on file. > SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a, > file2 b, file3 c, file4 d WHERE a.email = b.email and > concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company > = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested). > > Data manipulation is much more flexible once its in a database. At least when > you have something this squirrelly. > > Does the concat commands mean that there would no longer be a separate first name and last name? Also, it looks like the "=" is a requirement. Is that what you meant by the fields having to be the same? Or could I arrange for it to have file 1 overwrite file 2 for certain fields. Or only blank fields? -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On Sat, Nov 7, 2009 at 7:31 PM, Patton Echols <p.echols@...> wrote:
> The problem is that the various lists don't all have the same > information so I can't just cat them together and sort with a "unique" > operator. That's a vague statement. Here is what I mean by file and field: > I assume you're going to need to get this data back out and into the original applications, eventually. The basic strategy I'd take is to import everything into seperate tables in mysql. Create additional tables that you will export out of. Massage the data from the import tables into your output tables, merging and correcting as you can (this may be really easy or really hard depending on how clean your data is). Chances are you're going to need some cheap labor to clean the data up (got interns?) depending on the amount of data. I'd also suggest adding a unique ID to each record so that if you have to do this merge again, this will be a bit easier to handle. -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On Sun, Nov 08, 2009 at 11:41:34PM -0800, Patton Echols wrote:
> > > >> file 1 - | email | > >> file 2 - | email | f-name | l-name | > >> file 3 - | f-name | l-name | company | phone | > >> file 4 - | company | address | city | state | > >> > >> (By the way, each one has some duplicates with less than complete info.) > >> > >> What we need is to be able to populate all of these files with all the > >> available data for each contact. The concept I have in mind is to > >> compare the files in pairs. So file 1 would go into file-L (for Last) > >> Then file 1 would be compared with each line of file-L and if there is a > >> record with matching email, then any extra info is added , but not > >> overwriting anything that is already in any field So the final file > >> would look like this: [...] > > I'll vote for the mysql approach. Create a csv for each file, import that into > > mysql so you have 4 tables. Then create a query that will dump out the data > > the way you want. You can dump that into another table or export it. For this > > to work, the fields will have to match in each table. So Comany in file3 is > > equivalent to Comany in file4 (not something like Acme, Inc vs Acme, > > Incorporated). > > > > Do you mean that all the field data must match? Or just, for example, > the email fields? Because part of the problem is that I have non > matching / missing data depending on file. In order for the relations to work across all 4 files, the fields that are being keyed on, *should* match exactly. You can get around the missing data by using a 'left join' syntax. > > SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a, > > file2 b, file3 c, file4 d WHERE a.email = b.email and > > concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company > > = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested). > > > > Data manipulation is much more flexible once its in a database. At least when > > you have something this squirrelly. > Does the concat commands mean that there would no longer be a separate > first name and last name? No, that's just part of the query that contstructs a relationship across all four tables. The output part is the 'b.firstname'. The concat thing is just one way to make sure the columns you are trying to match, actually do match. This will be problematic if one table has 'John Doe' and the other has 'Johnathon Q. Doe, III' type of stuff. > > Also, it looks like the "=" is a requirement. Is that what you meant by > the fields having to be the same? Or could I arrange for it to have > file 1 overwrite file 2 for certain fields. Or only blank fields? An "equivalency" is a lot cleaner for the query part anyway (where you are matching data in one table against another). But you can do anything you want with the output part. So, instead of using b.firstname, you could use c.firstname instead. They would both be in their respective tables (a, b, c, and d). -- Hal -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On 11/09/2009 03:31 AM, Hal Burgiss wrote:
> On Sun, Nov 08, 2009 at 11:41:34PM -0800, Patton Echols wrote: > >>> >>> >>>> file 1 - | email | >>>> file 2 - | email | f-name | l-name | >>>> file 3 - | f-name | l-name | company | phone | >>>> file 4 - | company | address | city | state | >>>> >>>> (By the way, each one has some duplicates with less than complete info.) >>>> >>>> What we need is to be able to populate all of these files with all the >>>> available data for each contact. The concept I have in mind is to >>>> compare the files in pairs. So file 1 would go into file-L (for Last) >>>> Then file 1 would be compared with each line of file-L and if there is a >>>> record with matching email, then any extra info is added , but not >>>> overwriting anything that is already in any field So the final file >>>> would look like this: >>>> > > [...] > > >>> I'll vote for the mysql approach. Create a csv for each file, import that into >>> mysql so you have 4 tables. Then create a query that will dump out the data >>> the way you want. You can dump that into another table or export it. For this >>> to work, the fields will have to match in each table. So Comany in file3 is >>> equivalent to Comany in file4 (not something like Acme, Inc vs Acme, >>> Incorporated). >>> >>> >> Do you mean that all the field data must match? Or just, for example, >> the email fields? Because part of the problem is that I have non >> matching / missing data depending on file. >> > > In order for the relations to work across all 4 files, the fields that are > being keyed on, *should* match exactly. You can get around the missing data by > using a 'left join' syntax. > Ok, I'll read up on that syntax. Since the point of the exercise is to merge records where there is overlap AND no conflicting data. > > >>> SELECT a.email, b.firstname, b.lastname, c.company, d.address FROM file1 a, >>> file2 b, file3 c, file4 d WHERE a.email = b.email and >>> concat(b.firstname,b.lastname) = concat(c.firstname,c.lastname) and c.company >>> = d.company INTO OUTFILE '/tmp/output.txt' (crude and not tested). >>> >>> Data manipulation is much more flexible once its in a database. At least when >>> you have something this squirrelly. >>> > > >> Does the concat commands mean that there would no longer be a separate >> first name and last name? >> > > No, that's just part of the query that contstructs a relationship across all > four tables. The output part is the 'b.firstname'. The concat thing is just > one way to make sure the columns you are trying to match, actually do match. > This will be problematic if one table has 'John Doe' and the other has > 'Johnathon Q. Doe, III' type of stuff. > My current sense of the data is that the missing info is the basic problem, not conflicting info. What may need to happen is that we merge the missing info and leave "duplicate" records to be dealt with by hand where "John" has his name spelled different ways. > > >> Also, it looks like the "=" is a requirement. Is that what you meant by >> the fields having to be the same? Or could I arrange for it to have >> file 1 overwrite file 2 for certain fields. Or only blank fields? >> > > An "equivalency" is a lot cleaner for the query part anyway (where you are > matching data in one table against another). But you can do anything you want > with the output part. So, instead of using b.firstname, you could use > c.firstname instead. They would both be in their respective tables (a, b, c, > and d). > > > Ok, so the output is the part between the SELECT and FROM statements? So much to learn, so little time! (kidding, this is really helpful) Thanks. -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On 11/09/2009 12:08 AM, Justin Gruenberg wrote:
> On Sat, Nov 7, 2009 at 7:31 PM, Patton Echols <p.echols@...> wrote: > >> The problem is that the various lists don't all have the same >> information so I can't just cat them together and sort with a "unique" >> operator. That's a vague statement. Here is what I mean by file and field: >> >> > > I assume you're going to need to get this data back out and into the > original applications, eventually. > > The basic strategy I'd take is to import everything into seperate > tables in mysql. Create additional tables that you will export out > of. Massage the data from the import tables into your output tables, > merging and correcting as you can (this may be really easy or really > hard depending on how clean your data is). Chances are you're going > to need some cheap labor to clean the data up (got interns?) depending > on the amount of data. > To "massage" would you use the same basic approach that Hal did? Or was there some other way? Note: Where there are blank fields, then a full one would always "win". Where there is conflicting data, I'd probably want separate records to hand massage. but the owners of all this data may tell me that one of the tables is of sufficiently poor quality that it gets overwritten unless there is no better answer. Not sure about that yet. Interns? That'd be great, but I'm already the volunteer! (And while the computer guru in comparison, obviously no great database expert. </understatement> > I'd also suggest adding a unique ID to each record so that if you have > to do this merge again, this will be a bit easier to handle. > > Yeah, I don't want to do this again. The goal is to clean it up so that it can be used and (pray) kept clean until the organization can migrate everything to a more "professional" solution. Thanks for the reply. --PE -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
|
|
Re: scripting / data problem.On Mon, Nov 09, 2009 at 01:13:54PM -0800, Patton Echols wrote:
> Ok, so the output is the part between the SELECT and FROM statements? Yes, you 'select' which columns from which tables. As long as you build your relationships (the stuff being done with 'where) correctly the data should line up correctly and you can output it however you like, eg to a tab delimited file that can be read easily by a spreadsheet program. -- Hal -- ubuntu-users mailing list ubuntu-users@... Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users |
| Free embeddable forum powered by Nabble | Forum Help |