Archive

Archive for the ‘Storage’ Category

Check if GRUB is installed

January 22, 2014 Leave a comment

Odin! Guide our ships
Our axes, spears and swords
Guide us through, storms that whip
And in brutal war!!!
(Amon Amarth – The Pursuit of the Vikings)

GRUB boot loader installs on a MBR of a block device or on a partition during the operating system deployment procedure. A need rarely occurs for a GRUB reinstallation. One of the most often situations is when MD raid is used for boot device and one of the mirrored drives fails. New drive comes blank and grub has to be installed to ensure safe boot in case of potential failure of other drive.

To see if the grub is already installed on a drive, there are two methods. First method implements file command:

# file -s /dev/sda
/dev/sda: x86 boot sector; GRand Unified Bootloader, 
stage1 version 0x3, boot drive 0x80, 1st sector stage2 0x1941f250,
GRUB version 0.94; .....

But, this method is not reliable. Because we know that grub occupies first 512 bytes of a drive, we can use dd with output parsed by strings:

# dd bs=512 count=1 if=/dev/sda 2>/dev/null| strings
ZRrI
D|f1
GRUB 
Geom
Hard Disk
Read
 Error

If you can see string GRUB in this output, that tells us that Grub is indeed installed. Just for comparison, checking sda1 partition on the same drive gives a little bit different results:

# dd bs=512 count=1 if=/dev/sda1 2>/dev/null| strings
NTFS    
NTFSu
TCPAu$
fSfSfU
fY[ZfYfY
A disk read error occurred
BOOTMGR is missing
BOOTMGR is compressed
Press Ctrl+Alt+Del to restart

This partition is obviously NTFS with BOOTMGR from Windows. OK, I have a dual boot so this is ok 🙂 Now, just for demonstration purposes, lets fix a real world situation: sda from md raid failed and was replaced and this is what we found on a system:

[root@machine ~]# dd bs=512 count=1 if=/dev/sdb 2>/dev/null | strings
ZRrI
D|f1
GRUB 
Geom
Hard Disk
Read
 Error
[root@machine ~]# dd bs=512 count=1 if=/dev/sda 2>/dev/null| strings
[root@machine ~]# grub 
Probing devices to guess BIOS drives. This may take a long time.

    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
grub>  setup (hd0)
setup (hd0)
 Checking if "/boot/grub/stage1" exists... no
 Checking if "/grub/stage1" exists... yes
 Checking if "/grub/stage2" exists... yes
 Checking if "/grub/e2fs_stage1_5" exists... yes
 Running "embed /grub/e2fs_stage1_5 (hd0)"...  15 sectors are embedded.
succeeded
 Running "install /grub/stage1 (hd0) (hd0)1+15 p (hd0,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.
grub> quit
[root@machine ~]# dd bs=512 count=1 if=/dev/sda 2>/dev/null| strings
ZRrI
D|f1
GRUB 
Geom
Hard Disk
Read
 Error

It’s obvious that grub is installed on sdb but not on sda. After the grub installation we can check for results.

Categories: Linux, Storage Tags: , , ,

SRP tools problems

January 12, 2013 Leave a comment

It’s like rain on your wedding day
It’s a free ride when you’ve already paid
It’s the good advice that you just didn’t take
Who would’ve thought… it figures

(Alanis Morissette – Ironic)

SRP stands for SCSI RDMA Protocol. If you still don’t know what it is – a protocol that allows you to connect to SCSI devices attached to other computer, over Remote Direct Memory Access. Remote-Direct? Yeah, oxymoron. So if you wish to use RDMA, underlying network has to support it. So far, the most common usage is on InfiniBand networks.
I had a chance to play with SRP target / SRP client connection, and from my impressions – whole SRP field is still – let’s put it this way – not tested enough. I wasn’t impressed with what I saw. Software isn’t mature yet… But, since I’ve already started, lets dive in.

My distro of choice for production environment is CentOS, so I’ll talk about implementations available on CentOS. So, if you have Infiniband adapter, how should you start?

# yum install rdma infiniband-diags mstflint qperf
# yum install librdmacm libmlx4 libmthca srptools opensm
# /etc/init.d/rdma start
# /etc/init.d/opensm start

As you can notice, I’m using mlx4 adapter. And now the fun starts.
I have the targets already set up on port 1 of two-port Mellanox adapter. Port 2 is not connected (even IB cables cost as hell), so I won’t be using multipathing. I just want to simply connect to the target and use the disk. There is ibsrpdm command that allows you to scan the network for available targets, so we can use it:

# ibsrpdm -c
id_ext=0002c9030051060c,ioc_guid=0002c9030051060c,\
dgid=fe800000000000000002c9030051060d,pkey=ffff,service_id=0002c9030051060c

The output of the command can be fed to your ‘/sys/class/infiniband_srp/srp-_0-/add_target‘, and client will connect to target and see all the disks available to him on that target.
Now, to automate this task for you, and to do periodic re-scans, you can use srp daemon. At least that documentation says.

First, let me describe the way daemon starts on CentOS.
After you start the init script, it calls another bash script as daemon, ‘/usr/sbin/srp_daemon.sh‘. This script tries to trap handles, manage log, and after that, goes through all available adapters and ports, and runs new bash script, named ‘run_srp_daemon‘, for each of them. And finally, ‘run_srp_daemon‘ actually runs ‘srp_daemon‘ binary.

First problem I’ve noticed is that ‘srp_daemon‘ is called for available interfaces, no matter if they are up or down. This results in these errors:

25/10/12 17:34:58 : No response to inform info registration
25/10/12 17:34:58 : Fail to register to traps, maybe there is no opensm running on fabric
25/10/12 17:34:58 : SM LID is 0, maybe no opensm is running

Next thing I decided was to grab and unpack Debian Sid package, to see if how they did it. I’ve found a piece of code that checks if the interfaces is actually up and running before starting the daemon. So I ported that code to RHEL script. So that fixed that problem, but now I was onto new issue. ‘srp_daemon‘ just didn’t add targets at all. If I run the following command:

for i in `/usr/sbin/ibsrpdm -c`; do \
 echo $i > /sys/class/infiniband_srp/srp-mlx4_0-1/add_target;\
done

all the block devices are added and registered correctly, and can be seen in ‘fdisk -l‘ output. So, for a first couple of days I was running this command in ‘/etc/rc.local‘ script, but eventually get pissed off and decided to debug this thing further. So, running ‘srp_daemon‘ in verbose mode (-V) shows following information that can be seen as useful:

enter do_port
Found an SRP target with id_ext 0002c90300510648 - check if it allowed by rules file
Found an SRP target with id_ext 0002c90300510648 - check if it is already connected
id_ext=0002c90300510648,ioc_guid=0002c90300510648,dgid=fe800000000000000002c90300510649,pkey=ffff,service_id=0002c90300510648
Adding target returned 156

Now, this returned 156 sounds like something didn’t go quite right. So, I’ve fetched source code of srptools, and applied this patch:

--- srptools-0.0.4/srp_daemon/srp_daemon.c	2009-08-30 15:56:11.000000000 +0200
+++ srp_daemon.c	2013-01-12 03:11:09.000000000 +0100
@@ -183,6 +183,7 @@
 		}
 		ret = write(fd, target_str, strlen(target_str));
 		pr_debug("Adding target returned %d\n", ret);
+		pr_debug("Target string: %s\n", target_str);
 		close(fd);
 	}
 }

This actually lead me to see that srp_daemon is trying to add target with initiator_ext=… So, my solution was to erase ‘-n’ option. Finally, the patch looks like this:

--- srp_daemon.sh	2013-01-12 02:17:37.000000000 +0100
+++ srp_daemon.sh_patched	2013-01-12 02:17:52.000000000 +0100
@@ -108,8 +108,14 @@
 do
     for port in `/bin/ls -1 ${ibdir}/${hca_id}/ports/`
     do
-        ${prog} -e -c -n -i ${hca_id} -p ${port} -R ${retries} ${params}&
-        pids="$pids $!"
+        STATUS=`/usr/sbin/ibstat $hca_id $port | grep "State:"`
+        if [ "$STATUS" = "State: Active" ] ; then
+            ${prog} -e -c -i ${hca_id} -p ${port} -R ${retries} ${params}&
+            pids="$pids $!"
+        fi 
     done
 done

Now, the script was starting ‘srp_daemon‘ only if interface was up, and targets were added correctly. Logs were finally quiet.

But, next problem came after the reboot. Daemon didn’t start at all, although it was in the correct runlevel. After some more reboots of machine and few dozen restarts of services I’ve noticed that the issue was with ‘opensm‘ slow warmup… Now, init script of Subnet Manager has a ‘sleep‘ command after the start procedure, but it sleeps only for a second. Increasing that to 18 seconds did the trick. Yeah, I know, pretty much! But servers these days tend to do a lot of self-checks at each startup, so some additional seconds to allow ‘srp_daemon‘ to start sanely isn’t all that much. I’ve opened a bug report: https://bugzilla.redhat.com/show_bug.cgi?id=894546 , and here’s my patch:

--- opensm	2013-01-12 02:40:35.000000000 +0100
+++ opensm_patched	2013-01-12 03:04:52.000000000 +0100
@@ -61,7 +61,7 @@
     else
         $prog -B $prio >/dev/null 2>&1
     fi
-    sleep 1
+    sleep 20
     OSM_PID=`pidof $prog`
     checkpid $OSM_PID
     RC=$?

I will write about SCST target daemon in another post so stay tuned 😉

Expanding ZFS zpool RAID

January 1, 2013 5 comments

I’m big fan of ZFS and all the volume managing options it offers. It’s often that ZFS makes hard things easy and impossible things possible. In an era of ever growing data sets, sysadmins are regularly pressed with the need to expand volumes. While this may be easy to accomplish in an enterprise environment with IBM or Hitachi storage solutions, problems arise on mid and low level servers. Most often expanding volumes means online rsync to new data pool, then another rsync while the production system is down and finally putting new system to production. ZFS makes this process a breeze.

Here is one example where ZFS really shines. Take a look at the following pool:

# zfs list | grep "tank "
tank                        1.75T  31.4G  40.0K  /tank

and it’s geometry:

# zpool status tank
  pool: tank
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors

1.75TiB pool is slowly getting filled. As you see, it’s 6 disk pool consisted of two RAID-Z’s in stripe. It is approximate of RAID 50, in normal RAID nomenclature. It’s a lot of data to rsync over, isn’t it? Well, ZFS offers one neat solution. We could replace single disk with a bigger one, rebuild RAID and repeat the procedure 6 times. Finally, after last rebuild, we could ‘grow’ the pool to new size. In this particular case I decided to replace 500GB Seagate’s with 2TB Western Digital drives.

This is how the pool looks like after disk c2t5d0 is replaced with 2TB drive:

# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scan: resilvered 449G in 7h9m with 0 errors on Mon Dec 24 20:58:51 2012
config:

        NAME        STATE     READ WRITE CKSUM
        tank        DEGRADED     0     0     0
          raidz1-0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
          raidz1-1  DEGRADED     0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  UNAVAIL      0     0     0  cannot open

errors: No known data errors

Now we need to tell ZFS to rebuild the pool:

# zpool replace tank c2t5d0 c2t5d0

After this command, rebuild process will start. Few hours later, state of the system is:

# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Tue Jan  1 14:43:22 2013
    91.6M scanned out of 2.63T at 3.52M/s, 217h26m to go
    91.6M scanned out of 2.63T at 3.52M/s, 217h26m to go
    14.5M resilvered, 0.00% done
config:

        NAME              STATE     READ WRITE CKSUM
        tank              DEGRADED     0     0     0
          raidz1-0        ONLINE       0     0     0
            c2t0d0        ONLINE       0     0     0
            c2t1d0        ONLINE       0     0     0
            c2t2d0        ONLINE       0     0     0
          raidz1-1        DEGRADED     0     0     0
            c2t3d0        ONLINE       0     0     0
            c2t4d0        ONLINE       0     0     0
            replacing-2   DEGRADED     0     0     0
              c2t5d0/old  FAULTED      0     0     0  corrupted data
              c2t5d0      ONLINE       0     0     0  (resilvering)

errors: No known data errors

After the process is finished, pool will look something like this:

# zpool status tank
  pool: tank
 state: ONLINE
 scan: resilvered 449G in 7h1m with 0 errors on Tue Jan  1 21:44:37 2013
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
          raidz1-1  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0

errors: No known data errors

Once when all disks are replaced, only thing needed to grow the pool is to set autoexpand to on. If it was already on, then first turn it to off and then turn it again to on to grow the pool:

# zfs list | grep "tank "
tank                        1.75T  31.4G  40.0K  /tank
# zpool set autoexpand=off tank
# zpool set autoexpand=on  tank
# zfs list | grep "tank "
tank                        1.75T  5.40T  40.0K  /tank

And that’s it! We’ve grown striped 2x RAID-Z configuration from 500GB drives to 2TB drives, growing the total size from 1.7 TB to 7.10 TB. Enjoy the wonders of ZFS!

Categories: Solaris, Storage Tags: , , , , ,

Finding hdd serial number in Solaris

October 30, 2012 1 comment

Redeemers of this world
Dwell in hypocrisy:
“How were we supposed to know?”
(Nightwish – The Kinslayer)

ZFS is one of the those technologies out there that really kicks some serious ass. Data security and storage scalability is really of no match to any other volume manager + filesystem. But, being mechanical beasts, hard disks tend to fail sooner or later. Today I got alert from one of my systems, and this was the state I encountered:

# zpool status tank
 pool: tank
 state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
 see: http://www.sun.com/msg/ZFS-8000-9P
 scan: none requested
 config:
 NAME     STATE     READ WRITE CKSUM
 tank    DEGRADED     0     0     0
 raidz1-0  ONLINE     0     0     0
 c2t0d0  ONLINE       0     0     0
 c2t1d0  ONLINE       0     0     0
 c2t2d0  ONLINE       0     0     0
 raidz1-1  DEGRADED   0     0     0
 c2t3d0  ONLINE       0     0     0
 c2t4d0  ONLINE       0     0     0
 c2t5d0  DEGRADED     0     0    33  too many errors
 errors: No known data errors

No know data errors, and bad blocks on one of the hard drives in RAID5 – now how cool is that! Silent corruption is not even negotiable possibility 🙂 OK it’s time to replace hard drive, but how to locate it in the chassis? Even if you know the exact slot position, serial number is always a welcomed additional security measure. We don’t wanna replace wrong drive, do we? OK, so how can one see serial number of hard drive on Solaris? First try, iostat:

# iostat -E c2t5d0
 sd5       Soft Errors: 0 Hard Errors: 184 Transport Errors: 0
 Vendor: ATA      Product: ST3500630AS      Revision: C    Serial No:
 Size: 500.11GB <500107861504 bytes>
 Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
 Illegal Request: 24 Predictive Failure Analysis: 0

Now we know the hdd model, and some additional information, but really no way to distinct this drive from other 5 in this pool. The iostat command maybe works on SPARC systems, but obviously on homebuilt cheap storage server with SATA disks fails to deliver needed information. So, next try is cfgadm:

# cfgadm -alv | grep SN | grep c2t5d0
 sata0/5::dsk/c2t5d0            connected    configured   ok        Mod: ST3500630AS FRev: 3.AAC SN: 4BA6G5NN

OK now we have the serial and can be 100% certain what drive to replace.

Categories: Storage

Low cost JBODs…

July 6, 2009 Leave a comment

I work all night, I work all day, to pay the bills I have to pay
Aint it sad
And still there never seems to be a single penny left for me
Thats too bad
(ABBA – Money money money)

Today I intended to enjoy the ZFS + AVS installation on two SuperMicro JBOD’s. I didn’t have Thumpers in my hands yet, but this supposed to be a training session. Supposed.

On the scene, I learned that JBOD is divided into two parts. Left part (8 disks) connected to controller in slot 3, and right part (8 disks) connected to controller in slot 6. Bad part is that you cannot create arrays between this two parts. So, if you create for example RAID6 on the left side and on the right side, with an idea of striping data across that two arrays – alarm! A bad idea. if one controller dies on you, you loose everything. So what to do? Well, in the best practice of ZFS lowers, I created 8 single disk arrays on every side : )

I know what you’re thinking… You’re thinking this guy’s mad! Well, I’m not. General idea is to create ZFS RAID-10, by combining disk on port1 of left controller with disk on port1 of right controller into mirror, disk on port2 of left controller with disk on port2 of right controller into anthore mirror, and to stripe accross mirrors. This way I’m kinda safe if one controller completly breaks down.

OK, onto the installation. Well, not quite. Solaris 10u7 does not support 3ware controllers, and that’s the exact controller I had to deal with : ( So google says that Solaris Express and OpenSolaris support this particular controller. We’ll see about it next time.

Categories: Storage Tags: , , , ,
%d bloggers like this: