High Frequency Inserts to Postgres Database vs Writing to a File

View: New views
10 Messages — Rating Filter:   Alert me  

Parent Message unknown High Frequency Inserts to Postgres Database vs Writing to a File

by Jay Manni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

Hi:

 

I have an application wherein a process needs to read data from a stream and store the records for further analysis and reporting. The data in the stream is in the form of variable length records with clearly defined fields – so it can be stored in a database or in a file. The only caveat is that the rate of records coming in the stream could be several 1000 records a second.

 

The design choice I am faced with currently is whether to use a postgres database or a flat file for this purpose. My application already maintains a postgres (8.3.4) database for other reasons – so it seemed like the straightforward thing to do. However I am concerned about the performance overhead of writing several 1000 records a second to the database. The same database is being used simultaneously for other activities as well and I do not want those to be adversely affected by this operation (especially the query times). The advantage of running complex queries to mine the data in various different ways is very appealing but the performance concerns are making me wonder if just using a flat file to store the data would be a better approach.

 

Anybody have any experience in high frequency writes to a postgres database?

 

- Jay


Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Scott Marlowe-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 3, 2009 at 8:12 PM, Jay Manni <JManni@...> wrote:

> Hi:
>
>
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
>
>
>
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a

A common approach is to store them in flat files, then insert the flat
files at a later time so that if the db falls behind no data is lost.

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by David Saracini-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.
"could be several 1000 records a second."

So, are there periods when there are no/few records coming in?  Do the records/data/files really need to be persisted?  

The following statement makes me think you should go the flat file route:

"The advantage of running complex queries to mine the data in various different ways is very appealing"

Please don't be offended, but that sounds a little like feature creep.  I've found that it's best to keep it simple and don't do a bunch of work now for what might be requested in the future.

I know it's not exactly what you were looking for...  Just food for thought.

Best of luck!

David


From: Jay Manni <JManni@...>
To: "pgsql-performance@..." <pgsql-performance@...>
Sent: Tue, November 3, 2009 7:12:29 PM
Subject: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

Hi:

 

I have an application wherein a process needs to read data from a stream and store the records for further analysis and reporting. The data in the stream is in the form of variable length records with clearly defined fields – so it can be stored in a database or in a file. The only caveat is that the rate of records coming in the stream could be several 1000 records a second.

 

The design choice I am faced with currently is whether to use a postgres database or a flat file for this purpose. My application already maintains a postgres (8.3.4) database for other reasons – so it seemed like the straightforward thing to do. However I am concerned about the performance overhead of writing several 1000 records a second to the database. The same database is being used simultaneously for other activities as well and I do not want those to be adversely affected by this operation (especially the query times). The advantage of running complex queries to mine the data in various different ways is very appealing but the performance concerns are making me wonder if just using a flat file to store the data would be a better approach.

 

Anybody have any experience in high frequency writes to a postgres database?

 

- Jay


Parent Message unknown Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Greg Smith-22 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jay Manni wrote:
> The data in the stream is in the form of variable length records with
> clearly defined fields ? so it can be stored in a database or in a
> file. The only caveat is that the rate of records coming in the stream
> could be several 1000 records a second.
There's a few limits to be concerned about here, some physical, some
related to your application design.  A few thousand records is possible
with either the right software design or some hardware assist.  The wall
where it gets increasingly difficult to keep up is closer to 10K/second,
presuming your records aren't particularly wide.

Some background:  when you commit a record in PostgreSQL, by default
that transaction doesn't complete until data has been physically written
to disk.  If you take a typical 7200 RPM disk, that spins 120
times/second, meaning that even under the best possible conditions there
can only be 120 commits per second to physical disk.

However, that physical commit can contain more than one record.  Here
are the common ways to increase the number of records you can insert per
second:

1) Batch up inserts.  Turn off any auto-commit behavior in your client,
insert a bunch of records, issue one COMMIT.
Typical popular batch sizes are in the 100-1000 records/commit range.  
If individual records aren't very wide, you can easily get a huge
speedup here.  Hard to estimate how much this will help in your case
without knowing more about that width and the speed of your underlying
disks; more on that below.

2) Have multiple clients committing at once.  Typically I see this give
at most about a 5X speedup, so on a slow disk with single record commits
you might hit 600/s instead of 120/s if you had 10 clients going at once.

3) Use a RAID controller with a battery-backed cache.  This will hold
multiple disk commits in its cache and dump them onto disk in larger
chunks transparently, with only a small risk of corruption if there's an
extended power outage longer than the battery lasts.  Typically I'll see
this increase commit rate to the 1000-10,000 commits/second range, again
depending on underlying disk speed and row size.  This approach really
reduces the worst-case behavior disks can get into, which is where you
keep seeking between two spots writing small bits at each one.

4) Turn off synchronous_commit.  This lets you adjust the rate at which
records get committed into larger chunks without touching your
application or hardware.  It does introduce the possibility you might
lose some records if there's a crash in the middle of loading or
changing things.  Adjusting the commit period here upwards makes this
case look similar to (1), you're basically committing in larger chunks
but the app just doesn't know it.

Basically, (2) alone is probably not enough to reach 1,000 per second.  
But (1) or (3) is, as is (4) if you can take the potential data
integrity issues if there's a crash.  If your batchs get bigger via any
of these techniques, what should end up happening is that you push the
bottleneck to somewhere else, like disk write or seek speed.  Which of
those you'll run into depends on how interleaved these writes are with
application reads and the total disk bandwidth.

To close, here's a quick example showing the sort of analysis you should
be doing to better estimate here.  Imagine you're trying to write 10,000
records/second.  Each record is 100 bytes wide.  That works out to be
almost 1MB/s of steady disk writes.  In the real world, a single cheap
disk can't do much better than this if those writes involve heavy
seeking around the disk.  And that's happens in a database, because at a
minimum you need to write to both the main database area and the
write-ahead log.  If your records are 1,000 records wide instead, you
might hit the upper limit of your disk seeking capability at only
1,000/second.

Whereas if you have an app that's just writing to a file, you wouldn't
necessarily expect that to regularly seek elsewhere.  That means it's
much likely that you'd hit >10MB/s on writes rather than the 1-2MB/s
worst-case behavior when seeking.  Of course, as you've already noted,
you end up paying penalties on reporting instead if you do that.  The
best you can do here is to try and measure your application and
estimate/simulate larger volume, then see what happens if you apply one
or more of these techniques.

--
Greg Smith        greg@...        Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Merlin Moncure-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

n Tue, Nov 3, 2009 at 10:12 PM, Jay Manni <JManni@...> wrote:
> Hi:
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.

Postgres doing this is going to depend on primarily two things:
*) Your hardware
*) The mechanism you use to insert the data into the database

Postgres can handle multiple 1000 insert/sec but your hardware most
likely can't handle multiple 1000 transaction/sec if fsync is on.  You
definitely want to batch the insert into the database somehow, so that
something accumulates the data (could be a simple file), and flushes
it in to the database.   The 'flush' ideally should use copy but
multiple row insert is ok too.  Try to avoid inserting one row at a
time even if in a transaction.

If you are bulk inserting 1000+ records/sec all day long, make sure
you have provisioned enough storage for this (that's 86M records/day),
and you should immediately start thinking about partitioning and
rotating the log table (if you log to the database, partition/rotate
is basically already baked in anyways).

The effects on other users of the database are really hard to predict
-- it's going to depend on how much resources you have (cpu and
especially disk) to direct towards the loading and how the database is
being used.  I expect it shouldn't be too bad unless your dataase is
already i/o loaded.  The good news is testing this is relatively easy
you can simulate a load test and just run it during typical use and
see how it affects other users.  Standard o/s tools (iostat, top), and
database log with min_duration_statement are going to be a big help
here.   If you start seeing big leaps in iowait corresponding with
unexpectedly lagging queries in your app , you probably should think
about scrapping the idea.

merlin

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Parent Message unknown Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Greg Smith-21 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Jay Manni wrote:

> The data in the stream is in the form of variable length records with
> clearly defined fields ? so it can be stored in a database or in a
> file. The only caveat is that the rate of records coming in the stream
> could be several 1000 records a second.


There's a few limits to be concerned about here, some physical, some
related to your application design. A few thousand records is possible
with either the right software design or some hardware assist. The wall
where it gets increasingly difficult to keep up is closer to 10K/second,
presuming your records aren't particularly wide.

Some background: when you commit a record in PostgreSQL, by default
that transaction doesn't complete until data has been physically written
to disk. If you take a typical 7200 RPM disk, that spins 120
times/second, meaning that even under the best possible conditions there
can only be 120 commits per second to physical disk.

However, that physical commit can contain more than one record. Here
are the common ways to increase the number of records you can insert per
second:

1) Batch up inserts. Turn off any auto-commit behavior in your client,
insert a bunch of records, issue one COMMIT.
Typical popular batch sizes are in the 100-1000 records/commit range.
If individual records aren't very wide, you can easily get a huge
speedup here. Hard to estimate how much this will help in your case
without knowing more about that width and the speed of your underlying
disks; more on that below.

2) Have multiple clients committing at once. Typically I see this give
at most about a 5X speedup, so on a slow disk with single record commits
you might hit 600/s instead of 120/s if you had 10 clients going at once.

3) Use a RAID controller with a battery-backed cache. This will hold
multiple disk commits in its cache and dump them onto disk in larger
chunks transparently, with only a small risk of corruption if there's an
extended power outage longer than the battery lasts. Typically I'll see
this increase commit rate to the 1000-10,000 commits/second range, again
depending on underlying disk speed and row size. This approach really
reduces the worst-case behavior disks can get into, which is where you
keep seeking between two spots writing small bits at each one.

4) Turn off synchronous_commit. This lets you adjust the rate at which
records get committed into larger chunks without touching your
application or hardware. It does introduce the possibility you might
lose some records if there's a crash in the middle of loading or
changing things. Adjusting the commit period here upwards makes this
case look similar to (1), you're basically committing in larger chunks
but the app just doesn't know it.

Basically, (2) alone is probably not enough to reach 1,000 per second.
But (1) or (3) is, as is (4) if you can take the potential data
integrity issues if there's a crash. If your batchs get bigger via any
of these techniques, what should end up happening is that you push the
bottleneck to somewhere else, like disk write or seek speed. Which of
those you'll run into depends on how interleaved these writes are with
application reads and the total disk bandwidth.

To close, here's a quick example showing the sort of analysis you should
be doing to better estimate here. Imagine you're trying to write 10,000
records/second. Each record is 100 bytes wide. That works out to be
almost 1MB/s of steady disk writes. In the real world, a single cheap
disk can't do much better than this if those writes involve heavy
seeking around the disk. And that's happens in a database, because at a
minimum you need to write to both the main database area and the
write-ahead log. If your records are 1,000 records wide instead, you
might hit the upper limit of your disk seeking capability at only
1,000/second.

Whereas if you have an app that's just writing to a file, you wouldn't
necessarily expect that to regularly seek elsewhere. That means it's
much likely that you'd hit >10MB/s on writes rather than the 1-2MB/s
worst-case behavior when seeking. Of course, as you've already noted,
you end up paying penalties on reporting instead if you do that. The
best you can do here is to try and measure your application and
estimate/simulate larger volume, then see what happens if you apply one
or more of these techniques.

--
Greg Smith greg@... Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Jeff Janes :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Nov 3, 2009 at 7:12 PM, Jay Manni <JManni@...> wrote:
> Hi:
>
>
>
> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting.

Where is the stream coming from?  What happens if the process reading
the stream fails but the one generating the stream keeps going?

> The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
>
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a
> postgres (8.3.4) database for other reasons – so it seemed like the
> straightforward thing to do. However I am concerned about the performance
> overhead of writing several 1000 records a second to the database. The same
> database is being used simultaneously for other activities as well and I do
> not want those to be adversely affected by this operation (especially the
> query times).

I would not use the database, but just a flat file.  You can always load it
to a database later as long as you keep the files around, if a
compelling reason arises.

> The advantage of running complex queries to mine the data in
> various different ways is very appealing

Do you have concrete plans to do this, or just vague notions?

Even if the loading of 1000s of records per second doesn't adversely
impact the performance of other things going on in the server, surely
doing complex queries on hundreds of millions of records will.  How
long to you plan on storing the records in the database, and how to
delete them out?   Do you already know what indexes, if any, should be
on the table?

Jeff

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Anj Adu :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I have an application wherein a process needs to read data from a stream and
> store the records for further analysis and reporting. The data in the stream
> is in the form of variable length records with clearly defined fields – so
> it can be stored in a database or in a file. The only caveat is that the
> rate of records coming in the stream could be several 1000 records a second.
> The design choice I am faced with currently is whether to use a postgres
> database or a flat file for this purpose. My application already maintains a
> postgres (8.3.4) database for other reasons – so it seemed like the
> straightforward thing to do. However I am concerned about the performance
> overhead of writing several 1000 records a second to the database. The same
> database is being used simultaneously for other activities as well and I do
> not want those to be adversely affected by this operation (especially the
> query times). The advantage of running complex queries to mine the data in
> various different ways is very appealing but the performance concerns are
> making me wonder if just using a flat file to store the data would be a
> better approach.
>
>
>
> Anybody have any experience in high frequency writes to a postgres database?


As mentioned earlier in this thread,,make sure your hardware can
scale. You may hit a "monolithic hardware" wall and may have to
distribute your data across multiple boxes and have your application
manage the distribution and access. A RAID 10 storage
architecture(since fast writes are critical) with a mulitple core box
(preferably 8) having fast scsi disks (15K rpm) may be a good starting
point.

We have a similar requirement and we scale by distributing the data
across multiple boxes. This is key.

If you need to run complex queries..plan on aggregation strategies
(processes that aggregate and optimize the data storage to facilitate
faster access).

Partitioning is key. You will need to purge old data at some point.
Without partitions..you will run into trouble with the time taken to
delete old data as well as availability of disk space.

These are just guidelines for a big warehouse style database.

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Craig Ringer :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Merlin Moncure wrote:

> Postgres can handle multiple 1000 insert/sec but your hardware most
> likely can't handle multiple 1000 transaction/sec if fsync is on.

commit_delay or async commit should help a lot there.

http://www.postgresql.org/docs/8.3/static/wal-async-commit.html
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html

Please do *not* turn fsync off unless you want to lose your data.

> If you are bulk inserting 1000+ records/sec all day long, make sure
> you have provisioned enough storage for this (that's 86M records/day),

plus any index storage, room for dead tuples if you ever issue UPDATEs, etc.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: High Frequency Inserts to Postgres Database vs Writing to a File

by Jay Manni :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks to all for the responses. Based on all the recommendations, I am going to try a batched commit approach; along with data purging policies so that the data storage does not grow beyond certain thresholds.

- J

-----Original Message-----
From: Craig Ringer [mailto:craig@...]
Sent: Wednesday, November 04, 2009 5:12 PM
To: Merlin Moncure
Cc: Jay Manni; pgsql-performance@...
Subject: Re: [PERFORM] High Frequency Inserts to Postgres Database vs Writing to a File

Merlin Moncure wrote:

> Postgres can handle multiple 1000 insert/sec but your hardware most
> likely can't handle multiple 1000 transaction/sec if fsync is on.

commit_delay or async commit should help a lot there.

http://www.postgresql.org/docs/8.3/static/wal-async-commit.html
http://www.postgresql.org/docs/8.3/static/runtime-config-wal.html

Please do *not* turn fsync off unless you want to lose your data.

> If you are bulk inserting 1000+ records/sec all day long, make sure
> you have provisioned enough storage for this (that's 86M records/day),

plus any index storage, room for dead tuples if you ever issue UPDATEs, etc.

--
Craig Ringer

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


--
Sent via pgsql-performance mailing list (pgsql-performance@...)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance