I’m spoiled. While we build a fair number of environments each year, we also have basic starting standards. Because of this, I sometimes miss the basics when a problem shows up. Or at least it takes me longer to get there.
In this case, we had a couple of alerts over the high-volume weekend (Black Friday 2013). They were alerts from our connection monitor. We had done some tuning before the holiday, which included increasing MAXFILOP. This database is largely SMS tablespaces and an older version of DB2 (and WCS 6). The alerts were transient – as soon as someone logged in to look at them, connections were working just fine. Looking in the db2 diag log on Monday morning, I saw a number of entries like this:
2013-12-02-09.49.35.996713-300 I634382E367 LEVEL: Severe PID : 10811 TID : 47251525193888PROC : db2agent (ESB19Q02) 0 INSTANCE: db2inst1 NODE : 000 DB : ESB19Q02 APPHDL : 0-1206 APPID: *LOCAL.db2inst1.133012144937 FUNCTION: DB2 UDB, base sys utilities, sqleserl, probe:10 RETCODE : ZRC=0xFFFFEC73=-5005 2013-12-02-09.49.35.996180-300 I632343E481 LEVEL: Error PID : 10811 TID : 47251525193888PROC : db2agent (ESB19Q02) 0 INSTANCE: db2inst1 NODE : 000 DB : ESB19Q02 APPHDL : 0-1206 APPID: *LOCAL.db2inst1.133012144937 FUNCTION: DB2 UDB, config/install, sqlf_read_db_and_verify, probe:30 MESSAGE : SQL5005: sqlf_openfile rc = DATA #1 : Hexdump, 4 bytes 0x00007FFFFEC5864C : 0600 0F85
One time, I actually managed to catch the error at the command line – it looked like this:
$ db2 connect to esb19q02 SQL5005C System Error.
In researching this, I found this helpful technote: http://www-01.ibm.com/support/docview.wss?uid=swg21403936
And while I first thought that I needed to increase MAXFILOP, I figured out that it was really the ulimit that was my problem:
$ ulimit -a ... open files (-n) 1024 ...
This particular instance had three databases on it, all with SMS tablespaces, and one with over a thousand tables. The settings for MAXFILOP for the three databases added up to 4096.
In order to increase the limit, I added the following lines to /etc/security/limits.conf, as root:
db2inst1 soft nofile 16192 db2inst1 hard nofile 16192
… where db2inst1 is my instance owner.
Modifying the ulimit as the instance owner itself did not work:
$ ulimit -n 16192 -bash: ulimit: open files: cannot modify limit: Operation not permitted
Unfortunately, these settings do not take effect until the next time the database manager is started (db2stop/db2start), so I had to schedule that outage. I could have also done it with a failover to avoid the actual outage.
To prevent the issue, MAXFILOP could actually be lowered across the databases, with the side effect of possibly decreasing database performance, but preventing an actual inability to connect.
With the modifications to make automatic storage tablespaces so easy to use, and the default, I see fewer and fewer databases making extensive use of SMS tablespaces.