You are viewing lotso

You say Lemon, I say Lemonade (A story)

The past few weeks was not all that great as in addition to facing additional challenges at my primary day job, I also had to deal with my pet project in my day job to help smoothen my day job’s activities.

Some of you may know that my pet project involves pulling gobs of data into a PG instance to make my own version of a company datamart. I’m not talking about small gobs of data, but more towards in the range of 200+GB (It was more, but in one of the efforts to control/tune the server, I deleted close to 2-3 month’s worth of data.)

200+GB may not seem like much to you guys who gets to play with some real iron hardware or some “real” server hardware. All I had was just a Celeron 1.7G w/ 768MB of ram and some Gobs of IDE 7200 RPM drives. In short, all I had was lemons and I needed to make the best of it!

Actually, all was working fine and dandy up until I decided to make a slave server using Slony-I + PGpool and while that was a good decision, the involved hardware was the same if not worst(512MB ram only). When I started to implement that, I was faced with 2 issues.

1. Replication would lag behind by up to a day or so waiting for the next sync (dreaded fetch 100 from log) was taking to long.
2. My nightly vacuum job went from an average of 4+ hours to like 27+ hours.

So, in a effort to get things under control, I went through a few paths and hit more than my share of stumbling blocks. One of the things which I tried was to reduce the amount of “current” data in a particular table from 1 month -> 2 weeks -> 1 week (and move them into a so-called archive table but still in the same tablespace). This didn’t really bode well, as I initially tried to move the data in like 3 hourly chunks, which failed and to 1 hour chunks and then finally to 15 minutes chunks.

But in the end, it was all really futile because what i was essentially doing was just generate more and more IO activity (and that’s not a good thing). In addition to that, I also had to deal with vacuuming the tables due to PG’s MVCC feature and that was also not fun.

So, in the end, I broke my 3x500GB Raid 1 mirror (1 spare disk) and used the spare as the Slony-I log partition. Initially, that wasn’t all I did, I also included the 2 main problematic table, moving it from the main raid1 tablespace into that 1 disk tablespace. (that was also a mistake) and it didn’t help at all. IO activity was still high and I wasn’t able to solve my vacuuming process as wel.

Time for another plan.

This time around, what i did was to move the 2 big tables back into the raid1 tablespace and left the slony logs in the single disk. In addition to that, I also made a few alterations to the manner in which I pull data from the main MSSQL database and the way it was inserted into PG.

This time around, I’m utilising partitioning and some additional pgagent rules to automatically switch into a new table every 7 days and in doing so, I also had to change a few more other items to get things to work smoothly. I did this last Friday and based on the emailed logs, I think I’ve made a good decision as right now, everything seems peachy with the vacuum back to ~4 hours and there’s also no lag in the Slony replication.

I still hav another thing to do which is to alter the script I use to pull from the main Db as I’m being kicked (requested) to pull from an alternate DB which has a slightly different architecture.

2 disk Raid1 is definitely MUCH better than a single disk tablespace. With the amount of read/write activity that i have, it’s just not doable.

So, that’s how I made lemonade with my lemons. (hmm.. does this sound right?)
Tags: ,

Comments

great work, make the best of what you have :)

(Anonymous)

Can't wait to get involved

Hey - I am definitely glad to discover this. cool job!

April 2009

S M T W T F S
   1234
567891011
12131415161718
19202122232425
2627282930  
Powered by LiveJournal.com