The power button fiasco

So this Sunday when I woke up and tried to check my work email I found out I could not. In fact, I couldn’t hit work’s websites, VPN, SSH machines, or anything. When I finally arrive at work (it took me some time to get there; I was not at home) I see that my boss John is already there and has managed to get the internet up and running. Half the machines are off still and John tells me “it looks like the room lost power”.

So I start looking at things. Many of the machines shat themselves when coming back up because they are dependent on other servers that hadn’t started yet. The other half were rather confused… they are all set to “Last State” in the BIOS for what to do on power restore. It seems many of the machines couldn’t remember what state they were in. I should change those to just “Power on”.

So John investigated the UPS while I’m getting things up and running again. In the logs was a brief switch to battery followed shortly after by a dead battery. The next log item was “8:15am: Power off by front panel”. Wait, what?? Yeah. Someone pushed the shiny power button on the UPS and confirmed they wanted to shut it off. They then turned it back on and didn’t tell anyone what they had done (probably for fear of their job). I blurt out while still in shock someone would do that “Maybe having security silence alarms in the server room isn’t the best idea”.

A 2 hour support call for the phone system, 8 hours of my time, 4 hours of John’s time, a dead hard drive, and many upset researchers later all because at some point someone along the line decided it was a good idea to have unqualified people pushing buttons in the server room rather than just having them report the issue and let it beep. Thanks a lot! Have you learned your lesson yet?

Technology
Work

Comments (0)

Permalink

Proud

So I went to Toronto’s pride parade today. I was glad to see  contingents from more-0r-less all parties in the parade today; Green, NDP, Liberal, and PC. I almost had a chance to have a picture with Jack Layton today. Just before I got to the front of the line he decided he needed to go for a beer. Unfortunate for me, but he should be allowed. He is only human too.

My pride and joy today was meeting Elizabeth May and shaking her hand. After telling her how glad I was that she was at Toronto pride and in the parade, she excalimed how she loved pride week and has made quite an effort to be here the last number of years. The gentleman she was with was so kind to give me the green party t-shirt he walked the pride parade wearing.

Thank you Elizabeth May, thank you Jack Layton! Both your parties are the essence of what I belive in and it is a hard decision come election time.

Politics

Comments (0)

Permalink

Transit passes for rewards points

Although the number of points required to get a transit pass may be a little absurd, I think the city of Edmonton is on to something with selling bus passes for AirMiles. I have an air miles card and have often considerred the points gained from it as “worthless points” because I don’t travel unless someone else (work) is paying for it. If I can get a transit pass, even if it’s only once in a while, for what is essentially free it might convince me to change my habbits.

Since I bought a car a couple years ago I find I don’t take the bus anymore. It’s a hasstle to wait up to an hour for the scheduled bus to come. The buses are never on time so I have to go to the stop 10 minutes early to find the bus is running 5 minutes late. Then I get weaved through parts of the city I didn’t even know existed to get to the mall or the store or whereever I’m going. Wash, rinse, repeat for any transfers I have to make and the ride home. On top of it all, I have to pay for the ride. So why would I do this when I can just hop in my gas-guzzling car and be home and done 15 minutes later? Well, I would want do it to help save the environment. If passes were “free” it makes taking the bus more appealing to me. I imagine there are many others who are in my situation.

The problem with Edmonton’s AirMiles system is that it takes sooo many points to get a pass. I might be able to get one every year or two. They should push harder to get the number of points required down. They should petition credit card companies that have points systems to offer the same deal. Maybe I get a pass from AirMiles this month, and one from my VISA credit card the next month.

Environment

Comments (0)

Permalink

Sun Storage 7410 update

So we just learned a good, hard lesson with Sun Microsystems. As I have mentioned before, Perimeter was hitting quite the annoying bug with the 7410 that would cause Microsoft Office documents to take up to 45 seconds to load, save, autosave, etc. This issue was big enough that mid-April we had to assure our COO that someone (uhh, Sun’s employees are someone) was working around the clock until it was fixed. When we reported the issue with Sun we were assured by their support team that the next quarterly release would include a fix.

On April 27, the 2009 Q2 update was released. I diligently scheduled an upgrade as soon as possible. I installed the upgrade on Saturday May 2 and all seemed fine. When I arrive to work on Monday, I get some complaints about Microsoft Outlook and Adobe InDesign randomly crashing. After some investigation, the issue involved open files on the CIFS share from the Sun Storage. In the error log in all cases were “delayed write failed” messages. More investigation showed that CIFS kept restarting itself. Well, if the CIFS sessions are ending with a CIFS service crash, that would explain the delayed write failed issues.

So again, off to contact Sun technical support. They walk me through submitting a support bundle and let me know they will get back to me. Ok, fine. Until May 5 rolls around and the Storage system panicks in the middle of the day. Yay, more CIFS bugs: “reboot after panic: mutex_enter: bad mutex”.

Support finally gets back to me the next day. “There are so many bugs related to CIFS in this release that it is impossible to determine which ones you are hitting. The Q2 release is an entire new release of Solaris, so it’s not possible to backport any of them to the previous release. The developers are working on a fix. Since it’s currently in development we can’t predict a release date.” After this I had to work quite hard on trying to get a recommendation on what to do out of the support technician. Finally, he recommended that we revert to the latest 2008 Q4 release.

Ok! Let’s revert… but, wait. That would have to be scheduled. So I ask management if I can revert that night after hours so that we don’t chance losing any data. I’m told I’m not allowed to and that the Microsoft Office issue was big enough that unless the system panicks again I’m not to do anything to it. Risk of losing data is less important than having to wait a few seconds for a file to open.

Fast forward to May 12. Luckily no more system panicks yet. Sun releases patch 1 to their 2009 Q2 release. I schedule the install for the next evening. We have now been running with the 2009.Q2.1 release since May 13 without any issues. Yay!

So, my point and hard-earned experience: don’t upgrade to the first revision of a quarterly release from Sun!

Technology
Work

Comments (1)

Permalink

Noisy wind turbines

CBC has an article up today about wind turbines causing health problems from the noise they make. Lots of residents that live near wind turbines are coming together to claim that they are having ill health affects when they are near the wind turbines in their area.

Although I have visited the Huron Wind Farm near Kinkardine a number of times, I have never really been near any wind turbine that has been moving at very much speed. So I am not going to vouch for how noisy they are. I find it interesting though that Huron Wind claims in their fact file that

<< The noise produced by the Vestas wind turbines selected by Huron Wind is about 43
decibels at a distance of 250 metres. This is about the same as normal conversation.
Ambient noise in this area is normally above 50 decibels because of nearby industry. >>

This is very interesting. Ashbee and Lormand in the CBC article are claiming the turbines create a large whooshing noise that can be heard throughout their house, which is 450 meters from the nearest turbine. Not something you would expect from something that should be normal conversation level 200 meters away. This leads me to believe that either whoever installed those wind turbines installed the cheaper, noisier, crappier models, or Ashbee and Lormand have super hearing, or Vestas lies about the noise levels produced.

I have read before that wind farms are generally intended to live slightly off-shore. There are no trees to worry about, no bats to kill, and no houses to disturb there. Perhaps some of the models of wind turbines ignore trying to make them quiet because this is where they intend to deploy them.

Rather than banish the idea of wind farms as the article seems to be implying, a better idea may be to study, with actual decible meters and double-blind studies and all, the affects of wind turbine noise from different manufacturers and models of wind turbines. Following the studies government should implement regulations on turbines within a certain distance of residential areas. New installations can adhere to these standards and old installations can be retrofitted.

Wind energy is nothing new. Windmills have existed for a very long time. Local farms have used small windmills to generate electricity for the farm for many years. What is new is using wind energy on a large scale. Proper research and regulations need to be in affect to ensure the safety of everyone around the large-scale installations.

Environment
Off-the-grid

Comments (0)

Permalink

Sun Unified Storage 7410 put into production

Despite several speed bumps, we ended up deciding to keep the Sun storage system at PI. Here is what we found.

Our configuration:

  • 7410 Storage Controller with 16GB RAM, 2 x 2.3GHz Quad-Core processors and built-in 4 x 1Gb ethernet ports
  • 22 x 1TB SATA drives, Double parity RAID (14TB formatted)
  • 2 x 18GB Solid State write disks, RAID1
  • 100Gb Solid State read disk

This does not provide redundant controllers. The system can be made fully redundant by doubling the price and clustering multiple controllers and J4400s.

Exported, we have:

  • /export/home – NFSv3 and CIFS home directories
  • /export/vmware – NFSv3 for VMWare (currently 6 VMs)
  • exchangedb iSCSI lun – for exchange database files
  • exchangelogs iSCSI lun – for exchange transaction logs

Configuration issues:

  • The Sun tech that installed the system suggested to dedicate one of the ethernet interfaces to the administrative BUI as the analytics tool uses a lot of bandwidth. This proves to only be useful if you are hitting the administrative interface from the same network as the Sun system is on. Otherwise it sends all traffic out the interface configured for your default route, which is likely tied to your data interface. In our case we would be hitting the admin interface more-or-less always from a different network, making their suggestion a waste of an interface.
  • We wanted to do a combination of IPMP and LACP. This would allow us to aggregate two interfaces each to two different network switches. If one switch died the 7410 would fail over to using the other aggregate. The problem with this is there was no option in the BUI to configure failback to a preferred aggregate in the IPMP options. As we were using iSCSI to email and NFS to VMWare with the 7410 we would prefer to use a specific switch. Because of this we ended up using LACP to aggregate all four interfaces. This makes the system not resiliant to switch stack failure, but significantly reduces the chance of overloading cross-switch traffic.
  • We found that whenever we reconfigured the network on the 7410 that the system needed a reboot before it was accessible again. This isn’t an issue as network settings are a set once thing for us, but can be a pain for some people.

Performance:

  • We did a JetStress test of a 173GB test database on an iSCSI lun from the system. We achieved 750 IOPS. This was while the system was otherwise idle. Our NetApp FAS270 topped out at ~250 IOPS from the SATA shelf and ~400 IOPS from the local FC disk.
  • We ran iozone on the system.  I am not very good at interpreting the results nor making graphs of the results. It looks like we saw on average 80MB/s transfer for the test.
  • We see on average 130 IOPS use of the system, which is more than sufficient for us.
  • Performance can still be increased by adding up to 5 more (albeit expensive) 100Gb Solid State read disks.
  • Aside from the Microsoft Office bug (see below) we have not heard complaints about performance in two weeks.

Backups:

  • We were able to successfully do a full backup the NFS/CIFS fileshares from the 7410 using NDMP. The fileshares are not browsable via NDMP, so you must tell your NDMP client the full path to what you want to export.
  • Incremental or differential backups via NDMP is still a mystery. I need to open a support case for this.
  • iSCSI luns are not yet able to be backed up via NDMP. We are doing host-level exchange backups for this. There is no “SnapManager for Exchange”-like tool as NetApp had. Sun claims NDMP backups of iSCSI luns are on the development board for later this year.

Other issues:

  • There is a bug for file locking on shares that are exported both NFS and CIFS. When opening Microsoft Office documents from the share there is a 15-30 second wait while a file lock is aquired. This has been explained by Sun tech support as being the result of having to delve down to the individual SATA disk for the file lock. They are implementing a fix for the next release.
  • We created the fileshares with Reject non-UTF8 filenames. This is a default setting and not changable once the share is created. This caused issues copying files while using linux from the NetApp to the 7410. The NetApp had some latin1 encoded files that would not copy. We were 6 hours into the data move when the issue showed itself. We found a work-around to use CIFS to copy these specific files.
  • File permissions mapping between CIFS and NFS is just as bad as using NetApp in Mixed mode. This is due to Posix file permissions being inherently incompatible with ACLs. After a lot of work one can massage the permissions to work properly, but it’s mind boggling madness. I understand this is not so big an issue with NFS4 which uses ACLs by default, but we are stuck in NFS3 world.
  • Authenticated NFS (eg. kerberos) does not exist yet. Apparently this is in a future release.
  • User quotas don’t exist. There is a workaround to create a seperate share for each user. This is unmanageable in my opinion. It’s a good thing Perimeter’s administration thinks it’s too draconian to implement quotas.
  • Snapshots are not named in a user-intelligible way. They show up with the unix timestamp (eg. .auto-1238601600). They are only accessible from the root of the directory structure (eg. \\solar\home\.zfs\snapshot). The Windows Previous Versions tab does not show up. Thus, snapshots are a little less user-friendly, however they are still there and quite usable.

All in all, I like the system. It is significantly cheaper than anything else which was a major decision in keeping the unit and gives as much expandability as the NetApp 3410 would have. The unit comes with free software upgrades, and for a very cheap price (considering the cost of the unit) was purchased with 3 year hardware warranty.

Technology
Work

Comments (9)

Permalink

Earth hour – tonight

Environment

Comments (1)

Permalink

Pictures of Perimeter’s Sun Storage 7410

I took some pictures this morning of the new storage system.

Technology
Work

Comments (0)

Permalink

New toy – Sun Unified Storage 7410

Perimeter recently bought a 22TB single-controller model of Sun’s new 7410 model storage to replace an aging NetApp that was being used for email, VMWare, home directories and other network storage.

The 7410 promises similar speeds, performance and features to the NetApp model line we were looking at as a replacement. The 7410 uses mostly SATA disk, with some SSD drives to take advantage of the ZFS drive cache features. This has been promoted by Sun as providing quite significant performance enhancements. The main selling factor for us was that the Sun 7410 model comes in at a fraction of the price with 4 times the storage than the NetApp product as there are no software licenses to purchase and the hardware itself is cheaper.

So I get the joys and fun of devoting all my time over the next week to “test” the 7410. It’s obvious it fits our requirements for our environment… it’s got over double the storage capacity we need and is at least twice as fast as the NetApp we are replacing even before we add the SSD cache in.

The one issue I have with the particular model we bought is that it is not fully redundant. It could be if we spent twice as much money on it. All one needs to do is purchase a second controller and a second J4400, connect them together and voila, fully redundant. However, our server room racks are full, our UPS is full, it wasn’t budgeted for, and management doesn’t see this as an issue. Full IT outages as we upgrade firmware and OS on storage and network devices are common at Perimeter Institute. The NetApp solutions we were looking at were not fully redundant either.

I have been trying to figure out what use 14TB of formatted disk is when you only have it for a week. Although JetStress and IOZone can be run for a very long time, however I only really need to run them for a day for my purposes. I suppose I can start migrating data such as VMWare images to the 7410 early and see if anyone notices, but it just seems like there should be something more fun for me to do with it in the meantime.

Technology
Work

Comments (3)

Permalink

XenServer is now free

This announcement is awesome, however I will probably not get to play around with it at all. We currently under-use VMWare at work so there aren’t any virtualization projects in the near future. However, it’s really nice to have the ability to XenMotion for free.

Technology
Work

Comments (0)

Permalink