Unix Notebook

Saturday, July 16, 2005

Redhat Cluster 4 how-to

Redhat Cluster 4: Steps for setting up a 2 node cluster

* This document assumes that you’ve read the pdf found easily at Redhat’s site in the documentation section. It just sort of condenses it all for you if you want to make a 2 node cluster.
**Make sure you read all the tips at the very bottom or you could be in for some pain

A) What we want
1. Two servers in an active/passive cluser (one fails, the other takes over)
2. A shared storage area between them (disk array, luns from a SAN, etc.)
3. Floating IP
4. Bonded Ethernet interfaces (two interfaces on the same system acting as one highly available interface).
5. Power fencing (if system A appears unreachable, system B turns system A off and takes over the cluster).

B) Setup interface bonding (fairly easy)
1. I documented this at: http://unixnotebook.blogspot.com

C) Startup gui
1. service ccsd start
2. service cman start
3. export DISPLAY=xxx.xxx.xxx.xxx:0
4. system-config-cluster &

D) Configure your cluster (in all steps, “close” saves your progress in the setup screens. “file->save” saves your cluster.
1. When you first start you’ll be asked to create a new cluster – do so.
2. I chose DLM (distributed lock manager) because GULM requires 3 or 5 servers in your cluster.
3. Name your cluster something nice.
4. Create a node by clicking on “cluster nodes” and then clicking “add nodes”.
Choose a quorum vote of 1.
5. Create a fence device. I chose HP’s ILO choice. This require me to put AS the hostname the hostname of the actual ILO, not the hostname of the node in the cluster. I “named” my fences arbitrarily.
6. Assign those fences to the nodes by clicking on the nodes created in step 4 and clicking “manage fencing for this node”. When you do, a window pops up and you’ll click “add a new fence level”. Then you’ll click on that level (probably level 1 if you’ve just started) and click “add a fence to this level”. Then you choose the fence created in step 5 to the appropriate node.
7. Click “failover domains” and create a new failover domain. Use any name you want. I recommend unordered priority and unrestricted – let any node in the cluster run the cluster. Fool with it later if you have time.
8. Resources: resources are things like “IP addresses” of “shared filesystems” or “scripts”. Generally, your “scripts” will be the ones in /etc/init.d (httpd, for example). This part of the setup is straight forward. I’m pretty sure I had to put my IPs in /etc/hosts, but I’m not sure if that’s what made it work or not. BE VERY AWARE: “ifconfig –a” may or may not show eth0:1 or the like. I’m not sure how redhat does it or if it is a bug, but both of my 2.6.9-11 ELsmp kernels brought up the IPs but didn’t bring up the interface. Also, this was clearly a bug: I couldn’t create one particular IP address for the life of me (10.x.x.25). I then tried doing .98 and it worked fine. Something got “hosed” up in the plumbing probably.
9. Create services: create an arbitrary name for your service (whatever you like), and add resources to it. Order matters – IP’s and filesystems first and scripts last because they needed the other two. Also, you can nest your stuff. It appears that the top most layer is the base layer, and the lower layers are the things that rely on the base layer. Either way, it’s a little buggy in my opinion so I didn’t layer anything. It seems to work just laying them all down 1 after the other, from most basic to most complex. When you are ready, assign this service to a failover domain.
10. Save your script (goes to /etc/cluster/cluster.config)
11. Bring the other node into the cluster:
i. # service ccsd start (on the other node)
ii. # service cman start
iii. On the original system, bring up that gui again. This time you’ll see a management console tab and button in the top right corner which reads “Send to cluster”. The button saves the config to /etc/cluster/cluster.config. If there’s already one there (and there should be) it moves it to a backup file first and then saves. The last thing the button does is shoots the config over to the other system. If you need to, ftp will also do the trick but you shouldn’t need to.
iv. Exit the gui.
12. Start the cluster
i. On both systems, 1 node first and then the other, run the other two daemons:
ii. # service fenced start
iii. # service rgmanager start
13. Check the cluster
i. # clustat (you should see ‘stuff’)
14. If the cluster isn’t started
i. Go into the gui, go to the management tab and click on the service, then “enable” it. If it is in a failed state and won’t start, take down everything with all the “service x start/stop” commands and bring everything up. If it still doesn’t work, do some basic unix troubleshooting (permissions, groups, paths to resources/scripts, does it really mount like you think it will, is there an ip conflict, etc.) If that doesn’t work you’re in for the long haul…

Stuff that took me forever to figure out:

1) bonding was easy on one server that had a very up2date kernel. The one that was slightly behind in its upkeep had problems – bonding came up but we had a ton of kernel errors that I didn’t have time to really figure out, so I just upgraded the kernel – it worked.

2) As I mentioned above, Redhat Cluster didn’t (maybe doesn’t) make a virtual interface of the “ethx:1” variety. So don’t waste hours looking for it.

3) HEAR ME HEAR ME: Every start/stop script in Redhat Cluster requires a “status” option (eg., /etc/init.d/mysql.server status). If you don’t have one, redhat cluster will keep bounding your service. You’ll have to put a status check in there that returns 0 (zero).

4) Make sure 127.0.0.1 is only named “loopback” and not your server name. Make your servername a useful IP. You should also put any other names you can think of in your hostfile (like your fences).

5) HEAR ME HEAR ME: if you enable fencing and you are having problems with its stop/start procedures hanging, don’t reboot. Your system will hang as its coming up, forcing you to bring it up manually 1 process at a time, hitting No at that process.

6) The default log location for Redhat Cluster is /var/log/messages. Tail –f that file and grep for “clu” and you’ll see all the cluster-related messages.

7) If you ever get complaints from the gui at startup about XML syntax errors, well, it means that you screwed something up. I know, I can’t believe it either – the gui allows you to make impossible entries into your XML file. No matter how much you think you are doing it right, trust me, you messed up – and the gui let you.

8) As you struggle to get things going, ALWAYS check your services using “ps” and ALWAYS check to see if what you expect to be mounted or unmounted are the way you expect. Until you get everything right, you have to babysit your system – you literally could get the same filesystem mounted on two boxes, and services trying to start them. Its disgusting.

9) If you’re like me and you don’t have a CDROM connected to your proliant blade server and only have an iso, and you mount that for your install, you’re going to be asked to insert one of your Redhat Linux install disks (or you might). This is nuts. It auto-ejects your iso, and suddenly you need to put a disk in – which it expects to auto-mount for you when you click “ok”. The workaround is to copy everything from the iso to disk, and delete/move all rpms from the rpms directory that don’t pertain to your specific type of kernel (smp, bigmem, and the like). Then install all the rpms with a * as an argument – it worked for me.

10) HEAR ME HEAR ME HEAR ME: If you find that when you simulate a network failure to make the cluster failover (eg., ifdown eth0), and all you see is "CMANsendmesg failed: -101", then here's the problem: your power-fencing system is sending "poweroff" to the server, but the "acpid" service is interpreting it as "shutdown -h", which won't allow the server to come down unless its done gracefully. You need to go into /etc/acpi/events and change the config flie, then hup the daemon (/etc/init.d/acpid stop/start). The config file might be named "sample.conf" or something, that's fine - it'll use that (man acpid).

6 Comments:

  • Hello just stopped by to view your blog and to let all interested know I am offering free plumbing repair information and safety tips for California residents and all others interested. If you are not interested then please excuse us and please disregard this comment.

    For Plumbing Press Release and Free Plumbing Repair Information please feel free and stop by and read and take advantage of our blog info or viist **A-Affordableplumbing.Com** Thank you and have a great day.

    By Blogger answer-man, at 5:20 AM  

  • this article is very helpful for my lab on redhat cluster。my trouble was the httpd service couldn't start up,i think it's becouse the floating ip couldn't available before it。there is a tip in your article that when adding resource to the service,the top is the base layer。althouth this is very helpful,but it can not start up successfully。i solve this through adding IP resource firstly,and then attaching(not adding)the other service。It works。

    By Blogger showrun, at 11:44 PM  

  • The errors reported by the GUI on RHEL4 are because the XML validator has a bug. This is a known issue:
    http://sources.redhat.com/cluster/faq.html#gui_validityerror

    Anyhow, it was nice reading your post. I use it, for tests, using VMware (which is easier to manage).
    Check out my blog about it, at
    http://www.tournament.org.il/run

    Cheers!
    Ez

    By Blogger Ez, at 12:59 PM  

  • If you need the battery,you can visit here.

    By Blogger laptop battery, at 2:21 AM  

  • Hello ..
    I just came across ur article and it is very helpfull.but i came across some problems in configuring HP ILO as fencing device ,am using HP BL 460c baledes.Could you please guide me how to configure it

    Thanks
    Nasmel

    By Blogger nasmel, at 1:32 AM  

  • This momentousdecree warcraft leveling came as a great beacon light wow lvl of hope to millions of negroslaves wow power level who had been seared power leveling in the flames of power leveling withering wrath of the lich king power leveling injustice.wrath of the lich king power leveling it came as a WOTLK Power Leveling joyous daybreak to end the long WOTLK Power Leveling night ofcaptivity.WOTLK Power Leveling but one hundred years wlk power leveling later, we must face aoc gold the tragic fact thatthe age of conan power leveling negro is still not free. aoc power leveling one hundred years later,age of conan power leveling the lifeof the negro ffxi gil is still sadly crippled by the final fantasy xi gil manacles ofsegregation guild wars gold and the chains of discrimination. one hundred yearslater, maplestory mesos the negro lives on a lonely island of poverty in themidst of a vast ocean of material prosperity.dog clothes one hundred yearslater, the negro is still languishing in the corners of americansociety and finds himself an exile in his own land.

    By Blogger uiyui, at 5:26 PM  

Post a Comment

<< Home