John Kaster

Behind the Screen

Lightning often strikes twice

with 8 comments

Hi … lightning struck twice (figuratively speaking) and this time the result is much worse. We have had the SAME database server that failed last time fail again, even after it passed all diagnostic, and this time the RAID drive cluster was corrupted.

So, the good news: we’ve restored a backup to another server, that is also “fault tolerant” and are doing the same for another server, and setting up replication between them. The bad news: all database backups since Feb 16th, 2007 appear to be corrupted, and unfortunately the only way to learn this is to attempt to restore each one and wait about half an hour for the error to occur. So, we may be "reset" again back to Feb 16th.

So … what I really need from people is a listing of articles (you may have them in your RSS or Atom cache) that include:

  • ID for each article
  • Title
  • Abstract (summary)
  • Author name
  • and Language (if other than English)
  • For all articles since Feb 15th.

    Please reply here if you have them. Thanks.

    I have the Brasilian feed. I think the only other feeds that had live articles published are:

  • English
  • Japanese
  • Spanish
  • I’m going to do my best to resubmit those articles (there are about 20, I believe) in the proper order so all current links will still work. If I don’t have the original source, I’ll contact the authors to ask them to post them again.

    I apologize for the repeated problem. We have a wide variety of issues to address in a broad range of areas, and hadn’t been able to restructure this implementation yet. We should have triple redundancy in place starting tomorrow, since double redundancy has now failed twice.

    Advertisements

    Written by John Kaster

    March 1, 2007 at 7:40 pm

    Posted in EDN

    8 Responses

    Subscribe to comments with RSS.

    1. You can use BlogLines.com to get all of that. Just go there, setup a free account, add the feeds you want, and then at the bottom you can say Display All Items and it will go back really far.

      I can give you a partial history from some of them.

      http://dn.codegear.com/feed = http://www.bloglines.com/preview?siteid=9094555

      http://blogs.codegear.com/MainFeed.aspx = http://www.bloglines.com/preview?siteid=277971

      You can email me if you want. I will be online for the next couple hours at least and can help with other feeds if you don’t want to plug them into Bloglines. Just so you know, you can get a much longer history if you actually add them to your subscription.

      Jim McKeeth

      March 1, 2007 at 9:21 pm

    2. So you dont have the technology to keep your servers functioning reliably.
      And despite that, you want to sell web development software for mission critical work to other companies ?

      The mind boggles…

      T Branie

      March 2, 2007 at 2:25 am

    3. As someone who just came off of a flaky Exchange server experience, I feel your pain. No indication of hardware problems but things just kept getting corrupt and we spent weeks trying to troubleshoot it on a software level (which showed nothing wrong, but was complex enough to keep us chasing out tail). That is why I much prefer ‘blue smoke’ problems. That is, if you have a box and sparks fly and blue smoke rolls out, you know it is dead and you can do something about it. These ‘occasional corruption’ hardware problems are the worst!

      Leonard Gallion

      March 2, 2007 at 3:12 am

    4. Put fans in front of all harddisks, use SCSI disks bec. they are more reliable, and use registered RAM to avoid corrupted data to be written to disk.

      And do not use RAID – it increases the chance of failure. Instead, automate a backup process to tape.

      Frank de Groot

      March 2, 2007 at 3:13 am

    5. Even with (supposedly) good quality hardware (major brand server, ECC RAM, SCSI drives in a RAID array, on-board hardware diagnostics and dedicated air conditioning for the server room) things can go wrong, it just lessens your chances. Actually having all of this stuff ‘proves’ in your mind that it can’t be the hardware, when sometimes it is. Even ‘enterprise’ level redundancy can bite you when it replicates bad data across to your backup server (*sigh*). I guess the only lesson is "nothing is perfect".

      Leonard Gallion

      March 2, 2007 at 3:47 am

    6. That’s why we get clients to restore their IB databases onto a workstation using a batch file at least once a week.

      Great for reporting, developing, testing oh and to ensure that the backups are not corrupt. Also, gets a backup file off of the server.

      Jason Chapman (JAC2)

      March 2, 2007 at 4:47 am

    7. Even with good backup strategies, things can go wrong. Fact of life.
      It is no reflection on Codegear or the software they sell. At least they’re open enough to admit it. I’m sure lots of big companies have had similar problems (or worse) but we never hear about it. I think it speaks volumes for Codegears commitment to it’s customers.

      Greeny

      March 2, 2007 at 9:15 am

    8. Was this an Interbase database?

      Do you need some recommendations for reliable backups?

      Mark Hutchinson

      March 2, 2007 at 9:25 am


    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: