python compare and process a csv file

View: New views
3 Messages — Rating Filter:   Alert me  

python compare and process a csv file

by Siva Subramanian-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

Hello all,

 

I am new on this list and computer programming

 

I have two distinct statistical files (both csv)

1.       Report_2_5 – this is a report dump containing over a 10 million records and is different every day

2.       Customer_id dump – this is a daily dump of customers who have made payments. This is generally a million record


I need to extract past history depending on customers who make regular payments


For example,

Report_2_5

 

Customer ID, Plan_NO, stat1, vol2, amount3
2134, Ins1, 10000, 20000, 10
2112, Ins3, 30000, 20000, 10
2121, Ins3, 30000, 20000, 10
2145, Ins2, 15000, 10000, 5
2245, Ins2, 15000, 10000, 5
0987, Ins1, 10000, 20000, 10

4546, Ins1, 10020, 21000, 10

6757, Ins1, 10200, 22000, 10


customer_id dump

 

0987

4546

6757

2134


I need to process the Report_2_5 and extract the following output


Stat1: 40220

Vol2 : 83000

Amount3 : 40

 

I am new to programming and I have been extracting this data using MS – Access and I badly need a better solution.

 

Will really appreciate any sample code in python.

 

Thanks in advance

Siva



--
http://mail.python.org/mailman/listinfo/python-list

Re: python compare and process a csv file

by Chris Rebert-6 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 3, 2009 at 7:43 AM, Siva Subramanian
<elpostssv@...> wrote:

> Hello all,
>
> I am new on this list and computer programming
>
> I have two distinct statistical files (both csv)
>
> 1.       Report_2_5 – this is a report dump containing over a 10 million records and is different every day
>
> 2.       Customer_id dump – this is a daily dump of customers who have made payments. This is generally a million record
>
> I need to extract past history depending on customers who make regular payments
>
> For example,
>
> Report_2_5
>
> Customer ID, Plan_NO, stat1, vol2, amount3
> 2134, Ins1, 10000, 20000, 10
> 2112, Ins3, 30000, 20000, 10
> 2121, Ins3, 30000, 20000, 10
> 2145, Ins2, 15000, 10000, 5
> 2245, Ins2, 15000, 10000, 5
> 0987, Ins1, 10000, 20000, 10
>
> 4546, Ins1, 10020, 21000, 10
>
> 6757, Ins1, 10200, 22000, 10
> …
>
> customer_id dump
>
>
> 0987
>
> 4546
>
> 6757
>
> 2134
>
> I need to process the Report_2_5 and extract the following output
>
> Stat1: 40220
> Vol2 : 83000
> Amount3 : 40
>
> I am new to programming and I have been extracting this data using MS – Access and I badly need a better solution.

Have you considered using a proper SQL database? (See
http://en.wikipedia.org/wiki/SQL ; MySQL is one example:
http://en.wikipedia.org/wiki/MySQL)
Mucking around with CSV files like this is basically doing the work of
some simple SQL queries, only in an ad-hoc, inefficient manner. (MS
Access is essentially a non-industrial-strength SQL for
non-programmers.)

Cheers,
Chris
--
http://blog.rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list

Parent Message unknown Re: python compare and process a csv file

by Peter Otten :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Chris Rebert wrote:

> On Tue, Nov 3, 2009 at 7:43 AM, Siva Subramanian
> <elpostssv@...> wrote:
>> Hello all,
>>
>> I am new on this list and computer programming
>>
>> I have two distinct statistical files (both csv)
>>
>> 1.       Report_2_5 – this is a report dump containing over a 10 million
>> records and is different every day
>>
>> 2.       Customer_id dump – this is a daily dump of customers who have
>> made payments. This is generally a million record
>>
>> I need to extract past history depending on customers who make regular
>> payments
>>
>> For example,
>>
>> Report_2_5
>>
>> Customer ID, Plan_NO, stat1, vol2, amount3
>> 2134, Ins1, 10000, 20000, 10
>> 2112, Ins3, 30000, 20000, 10
>> 2121, Ins3, 30000, 20000, 10
>> 2145, Ins2, 15000, 10000, 5
>> 2245, Ins2, 15000, 10000, 5
>> 0987, Ins1, 10000, 20000, 10
>>
>> 4546, Ins1, 10020, 21000, 10
>>
>> 6757, Ins1, 10200, 22000, 10
>> …
>>
>> customer_id dump
>>
>>
>> 0987
>>
>> 4546
>>
>> 6757
>>
>> 2134
>>
>> I need to process the Report_2_5 and extract the following output
>>
>> Stat1: 40220
>> Vol2 : 83000
>> Amount3 : 40
>>
>> I am new to programming and I have been extracting this data using MS –
>> Access and I badly need a better solution.
>
> Have you considered using a proper SQL database? (See
> http://en.wikipedia.org/wiki/SQL ; MySQL is one example:
> http://en.wikipedia.org/wiki/MySQL)
> Mucking around with CSV files like this is basically doing the work of
> some simple SQL queries, only in an ad-hoc, inefficient manner. (MS
> Access is essentially a non-industrial-strength SQL for
> non-programmers.)

Industrial strength or not, Access should be capable of solving the OP's
problem. So it would be interesting what's so bad about it in this case.

Anyway, here's a database-free python solution:

import csv

REPORT = "report.csv"
CUSTOMERS = "customers.csv"

with open(CUSTOMERS) as instream:
    next(instream) # skip header

    # put customer ids into a set for fast lookup
    customer_ids = set(line.strip() for line in instream)

with open(REPORT) as instream:
    rows = csv.reader(instream)
   
    # find columns
    headers = [column.strip() for column in rows.next()]
    customer_column = headers.index("Customer ID")
    sum_over_columns = [headers.index(s) for s in "stat1 vol2
amount3".split()]

    # initialize totals
    sigma = [0] * len(headers)

    # calculate totals
    for row in rows:
        if row[customer_column] in customer_ids:
            for index in sum_over_columns:
                sigma[index] += int(row[index])
# print totals
for index in sum_over_columns:
    print "%-10s %10d" % (headers[index] + ":", sigma[index])
 
The limiting factor for this approach is the customer_ids set which at some
point may no longer fit into memory.

Peter

--
http://mail.python.org/mailman/listinfo/python-list