Archive for October, 2012

Finding hdd serial number in Solaris

October 30, 2012 1 comment

Redeemers of this world
Dwell in hypocrisy:
“How were we supposed to know?”
(Nightwish – The Kinslayer)

ZFS is one of the those technologies out there that really kicks some serious ass. Data security and storage scalability is really of no match to any other volume manager + filesystem. But, being mechanical beasts, hard disks tend to fail sooner or later. Today I got alert from one of my systems, and this was the state I encountered:

# zpool status tank
 pool: tank
 state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
 scan: none requested
 tank    DEGRADED     0     0     0
 raidz1-0  ONLINE     0     0     0
 c2t0d0  ONLINE       0     0     0
 c2t1d0  ONLINE       0     0     0
 c2t2d0  ONLINE       0     0     0
 raidz1-1  DEGRADED   0     0     0
 c2t3d0  ONLINE       0     0     0
 c2t4d0  ONLINE       0     0     0
 c2t5d0  DEGRADED     0     0    33  too many errors
 errors: No known data errors

No know data errors, and bad blocks on one of the hard drives in RAID5 – now how cool is that! Silent corruption is not even negotiable possibility 🙂 OK it’s time to replace hard drive, but how to locate it in the chassis? Even if you know the exact slot position, serial number is always a welcomed additional security measure. We don’t wanna replace wrong drive, do we? OK, so how can one see serial number of hard drive on Solaris? First try, iostat:

# iostat -E c2t5d0
 sd5       Soft Errors: 0 Hard Errors: 184 Transport Errors: 0
 Vendor: ATA      Product: ST3500630AS      Revision: C    Serial No:
 Size: 500.11GB <500107861504 bytes>
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 Illegal Request: 24 Predictive Failure Analysis: 0

Now we know the hdd model, and some additional information, but really no way to distinct this drive from other 5 in this pool. The iostat command maybe works on SPARC systems, but obviously on homebuilt cheap storage server with SATA disks fails to deliver needed information. So, next try is cfgadm:

# cfgadm -alv | grep SN | grep c2t5d0
 sata0/5::dsk/c2t5d0            connected    configured   ok        Mod: ST3500630AS FRev: 3.AAC SN: 4BA6G5NN

OK now we have the serial and can be 100% certain what drive to replace.

Categories: Storage

Snort: too many open files

October 19, 2012 6 comments

Creation of insane rule
All we hear:
Desperate cry
(Sepultura – Desperate Cry)

I really hate those unproductive hours (hopefully not days) when one needs to debug some strange problems whose solution won’t be reusable. Hm? Deja-vu? 🙂 Well it hit me again. And this time it was hard.

I was trying to write some manifests and control our local Snort installation through puppet. We use VRT and emerging rules, fetched via pulledpork. So, puppetizing Snort should be like a breeze. And it was… Everything went extremely well, I wrote two classes: snort and snort::pulledpork (along with standard params class). Data was stored in hiera, /etc/sysconfig/snort, /etc/snort/snort.conf and all of the pulledpork configs are dynamicly generated from that data. World looked really nice. And I was a happy devop 😉

But the problems started later – when I actually tried to start snort service. Service was just failing miserably without any significant output. I’ve tried ‘bash -x’ on the init script, and running manual command, but I was getting nowhere. Then I turned to syslog, and I saw a bunch of Snort startup messages and then all of a sudden:

rsyslogd-2177: imuxsock begins to drop messages from pid 17207 due to rate-limiting

Well, temporary fix for that issue was:

# echo "$SystemLogRateLimitInterval 0" > /etc/rsyslog.d/test.conf

And offcourse, you need to restart rsyslogd after this one… Pretty strange that default syslog in CentOS 6 is so itchy about being filled up too fast… I did like old syslogd behaviour more…

Anyways, back to the main issue. After “fixing” rsyslogd I finally had something to work on:

FATAL ERROR: /etc/snort/rules/VRT-app-detect.rules(0)
 Unable to open rules file "/etc/snort/rules/VRT-app-detect.rules":
 Too many open files.#012

Now we’re getting there! It’s a piece of cake to solve:

# echo "ulimit -n 10240" >> /etc/sysconfig/snortd

Although it didn’t work… So off I was on a lonely path of useless debugging. Why doesn’t this work? Maybe it’s something with the system? Trying to increase fs.file-max to absurd levels didn’t help…. Maybe it’s to do with account snort will run as – snortd? Trying to utilize limits.conf didn’t work either. Now I was buffled. One thing I did notice was that after raising ulimit on number of open files, snort was starting, or should I say failing, a lot longer… Then I decided to utilise strace. Number in the “read” system call was just raising and raising until hitting the maximum. The weird thing was that it always broke on the exact same file… That drag me away of real problem. So nothing helped so far – so I decided to dismantle snort configurations and rules. And after zeroing out one config file – snort started! Now we’re talking. I decided to uncomment line by line. After 3/4 of lines, another error… And now I finally saw the culprit!!!

include $RULE_PATH/VRT.conf

Utter facepalm… I’ll leave you to guess the name of the file that contained that line…

Problems with grub and (U)EFI

October 16, 2012 5 comments

You keep this love, thing, child, toy
You keep this love, fist, scar, break
You keep this love
(Pantera – This love)

I really hate those unproductive hours (hopefully not days) when one needs to debug some strange problems whose solution won’t be reusable. All of us have some of those. Although IBM delivers some fine hardware, sometimes strange problems arise. So this was one of those days…

We use Cobbler + PXE for unattended CentOS installations, and didn’t have problems on many different machines. But the day arrived 🙂 There were two kind of servers at our disposal: x3550 M4 1U and x3630M4 2U with attached DAS. First equipped with 2x900GB SAS drives and second with bunch of disks 🙂 x3550M4 were set up in RAID0 (striping) configuration, and with stock BIOS settings PXE installation went like a charm. On the other hand, two of the system drives (another pair of 900GB SAS) on x3630M4 were set up as mirror, and installation went smoothly but grub refused to start. Only output we had was a blinking cursor on the top left corner of empty black screen. And it’s kinda hard to google for solution of that error 😉

CentOS 6 uses old Grub 0.97, but that was not the issue. Booting livecd, manually forcing grub installs, installing on MBR of other block devices present on the system – nothing worked…. After two days of pain, solution finally came around the block… Two simple steps we’re needed to fix it forever:

  • enter bios
  • Load Default Settings
  • System Settings => Legacy support
  • disable BBS Boot
  • back to main menu (ESC)
  • Save Settings
  • Exit Setup

We’ve tried another dozens of combinations but final solution was to 1) load defaults and 2) disable BBS Boot. We haven’t tried booting from GPT partition table, so maybe that would work too. But why bother if virtual block device is 900GB – msdos partition table would sufice. So if anyone comes across of this kind of problem: grub not loading from msdos partition table on a system that also has GPT patition tables on another set of block devices, with or without EFI/UEFI setting in BIOS => hope this blog entry saves you some time by disabling Bunch of Bull Shit BIOS option.

As someone over at Pantera’s video wrote:

You know the world sucks when you search “This Love” and the results are maroon 5

Categories: Cobbler, Linux Tags:
%d bloggers like this: