Thomas Zeiser

Some comments by Thomas Zeiser about HPC@RRZE and other things

Content

Why cfx5solve from Ansys-13.0 fails on SuSE SLES11SP2 …

Recently, the operating system of one of RRZE’s HPC clusters was upgraded from SuSE SLES10 SP4 to SuSE SLES11 SP2 … one of the few things which broke due to the OS upgrade is Ansys/CFX-13.0. cfx5solve now aborts with

ccl2flow: * command language error *
Message: getChildList: unable to find the requested path
Context: returned by cclApi call

As one can expect, Ansys does not support running Ansys-13.0 on SuSE SLES11 or SLES11 SP2. There are also lots of reports on this error for different unsupported OS versions in the CFX forum at cfd-online but no explanations or workarounds yet.

So, where does the problem come from? A long story starts …

First guess: SuSE SLES11 SP2 runs a 3.0 kernel. Thus, there might be some script which does not correctly parse the uname or so. However, the problem persists if cfx5solve is run using uname26 (or the equivalent long setarch variant). On the other hand, the problem does not occur if e.g. a CentOS-5 chroot is started on the SLES11 SP2 kernel, i.e. still the same kernel but old user space. This clearly indicates that it is no kernel issue but some library or tool problem.

Next guess: Perl comes bundled with Ansys/CFX but it might be some other command line tool from the Linux distribution which is used by cfx5solve, e.g. sed and friends or some changed bash behavior. Using strace on cfx5solve reveals several calls of such tools. But actually, none of them is problematic.

Thus, it must be a library issue: Ansys/CFX comes with most of the libraries it needs bundled but there is always the glibc, i.e. /lib64/ld-linux-x86-64.so.2, /lib64/libc.so.6, etc. SuSE SLES10 used glibc-2.4, RHEL5 uses glibc-2.5 but SLES11 SP2 uses glibc-2.11.3

The glibc cannot be overwritten using LD_LIBRARY_PATH as any another library. But there are ways to do it anyway …

The error message suggests that ccl2flow.exe is causing the problems. So, let’s run that with an old glibc version. As cfx5solve allows specifying a custom ccl2flow binary we can use a simple shell script to call the actual ccl2flow.exe using the loader and glibc libraries from the CentOS5 glibc-2.5. Nothing changes; still the very same getChildList error message in the out file. Does that mean that ccl2flow.exe is not the bad guy?

Interlude: Let’s see how ccl2flow.exe is called. The shell wrapper for ccl2flow was already there, thus, let’s add some echo statements to the command line arguments and a sleep statement to inspect the working directory. Et vola. On a good system, a quite long ccl file has just been created before ccl2flow is called; however, on a bad system running the new OS the ccl file is almost empty. Thus, we should not blame ccl2flow.exe but what happens before. Well, before there is just the Ansys supplied perl running.

Let’s have a closer look at the perl script: Understanding what the cfx5solve Perl script does seems to be impossible. Even if the Perl script is traced on a good and bad system there are no real insights. At some point, the bad system does not return an object while the other does. Thus, let’s run perl using the old glibc version. That’s a little bit more tricky as cfx5solve is not a binary but a shell script which calls another shell script before finally calling an Ansys-supplied perl binary. But one can also manage these additional difficulties. Et vola, the error message disappeared. What’s going on? Perl is running fine but producing different results depending on the glibc version.

Interlude Ansys/CFX-14.0: This version if officially only supported on SuSE SLES11 but not SLES11 SP2 if I got it correctly. But it runs fine on SLES11 SP2, too. What Perl version do they use? Exactly the same version, even the very same binary (i.e. both binaries have the same checksum!). Thus, it is not the Perl itself but some CFX-specific library it dynamically loads.

End of the story? Not yet but Almost. Spending already so much time on the problem I finally wanted to know which glibc versions are good or evil. I already knew Redhat’s glibc-2.5 is good and SuSE’s glibc-2.11.3 is evil. Thus, let’s try the versions in between using the official sources from ftp.gnu.org/gnu/glibc. Versions <2.10 or so require a fix for the configure script to recognice a modern as or ld as good version. A few versions do not compile properly at all on my system. But there is no bad version, even with 2.11.3 there is no CFX error. Only starting from glibc-2.12.1 on there is the well-known ccl2flow error. Not really surprising. SuSE and other Linux distributors have long lists of patches they apply, including back-ports from newer releases. There are almost 100 SuSE patches included in their version of glibc-2.11.3-17.39.1; no chance to see what they are doing.

My next guess is that the problem must be a commit between 2.11.3 and 2.12.1 of the official glibc versions. GNU proves a Git repository and git bisect is your friend. This leads to commit f89d2f30 from Dec. 2009: Enable multiarch whenever possible. This commit did not change any actual code but only the default configuration parameters. That means, the code causing the fault must be in the sources already much before. It only debuted once multi-arch was switched on in 2.12.1 of the vanilla version or earlier in the SuSE version (the spec file contains an --enable-multi-arch line; proved).

Going back in history, it finally turns out that glibc commit ab6a873f from Jun 2009 (SSSE3 strcpy/stpcpy for x86-64) is responsible for the problems leading to the failing ccl2flow.

Unfortunately, it is not possible to see if the most recent glibc versions still cause problems as cfx5solve already aborts earlier with some error message (Can’t call method “numstring” on an undefined value).

It is also not clear whether it is a glibc error, a problem in one of the CFX library or if it just because of the tools used when Ansys-13.0 was compiled.

End of the story: If you a willing to take the risk of getting wrong results, you may make v130/CFX/tools/perl-5.8.0-1/bin/Linux-x86_64/perl use an older glibc version (or one compiled without multi-arch support) and thus avoid the ccl2flow error. But who knows what else fails visibly or behind the scene. There is a unknown risk of wrong results even if cfx5solve now runs in principle on SuSE SLES11 SP2.

I fully understand that users do not want to switch versions within a running project. Thus, it is really a pity that ISVs force users (and sys admins) to run very old OS versions. SuSE SLES 10 was released in 2006 and will reach end of general support in July 2013; SLES11 was released in March 2009 while Ansys13 was released only in autumn 2010. And we still shall stick to SLES10? It’s time to increase the pressure on ISVs or to start developing in-house codes again.