How do I troubleshoot Lightweight Grid client issues?
Background Information
It is assumed that:
- gridMathematica Server has been installed on a compute node.
- The Mathematica component has been activated.
- The Lightweight Grid Manager component is running and configured with default settings.
The default configuration may not be sufficient for some setups, and issues may arise when launching gridMathematica kernels set up via the Lightweight Grid Client. For operating system-specific setup instructions, please refer to:
Common Warning Messages
Most issues can be overcome by a thorough reading of the Lightweight Grid Manager documentation.
The Server Is Not Visible in the Lightweight Grid Client
gridMathematica Server broadcasts itself on the network with multicast DNS service discovery technology. The multicast-based service discovery that the server provides allows other computers to find it on the local network. This is built into Mathematica, which will automatically find any servers in the Lightweight Grid on the local subnet, when you enable the Lightweight Grid tab in the parallel preferences.
If Mathematica is not running on the same network as computers running the Lightweight Grid Manager, you will have to enter the name of one computer that runs a Lightweight Grid Manager. When you do this, all the other computers that this one computer knows about will be made available to Mathematica. This works because each gridMathematica Server keeps track of any other servers that it finds with the service discovery technology.
If the server is not visible after entering its name, check that your software firewall allows other hosts to connect to port 3737. Also check that your software firewall allows traffic both ways on port 5353 (mDNS) to allow the service discovery technology to function.
There may also be a network host naming issue. Within the Lightweight Grid Manager, change the ContactURL server configuration parameter—for example, using the IP number of the server.
A Kernel Could Not Be Started (LightweightGridClient`RemoteKernelOpen::launchfailed)
The following assumes that the Lightweight Grid Client has been configured to request kernels from a server called “wolframServer”. When trying to connect to these kernels, the following warning appears:
LightweightGridClient`RemoteKernelOpen::launchfailed: Kernel could not be started on wolframServer
This suggests that gridMathematica Server is running, but that a licensed kernel was not available.
Check that the license information (the mathpass file) can be found by the Lightweight Grid Manager. It is recommended to put the mathpass file at $BaseDirectory/Licensing/mathpass. Moreover, check that the license in the mathpass file is not expired.
Cannot Connect to the Server (LightweightGridClient`RemoteKernelOpen::lwgconnect)
The following assumes that the Lightweight Grid Client has been configured to request kernels from a server called “wolframServer”. When trying to connect to these kernels, the following warning appears:
LightweightGridClient`RemoteKernelOpen::lwgconnect: Unable to connect to http://wolframServer.example.com:3737/WolframRemoteServices/Manager. Check network connectivity and the spelling of the hostname or URL of the remote computer. Confirm that a Lightweight Grid Manager is running on the remote computer.
This suggests that the published name of the machines, wolframServer, is not visible across your network, or that the Lightweight Grid Manager is not running. You could confirm this by using a web browser on the machine that runs the master Mathematica. If the browser cannot connect to the URL shown in the error message, http://wolframServer.example.com in this example, this indicates there is a network host-naming issue or that the Lightweight Grid Manager is not running.
Check that the Lightweight Grid Manager is running on the server. If gridMathematica Server is running, then either fix your network so that you can find machines by name or change the ContactURL server configuration parameter—for example, using the IP number of the server.
Cannot Connect to the Link (LinkConnect::linkc)
When trying to connect to a gridMathematica kernel, the following warning appears:
LinkConnect::linkc: Unable to connect to LinkObject[29778@192.168.70.1,29780@192.168.70.1,20,8].
LinkObject::linkn: Argument LinkObject[29778@192.168.70.1,29780@192.168.70.1,20,8] in LinkRead[LinkObject[29778@192.168.70.1,29780@192.168.70.1,20,8]] has an invalid LinkObject number; the link may be closed.
This suggests that the server was found and responded with a link, but that the link could not be used. The clue here is that the name of the link, 192.168.70.1, is a private network name. This name has been picked by the link system called by the Lightweight Grid Manager, but is a name not visible outside of this machine. You can inspect the link further by using manual launching from the web interface.
The link will automatically choose two ports, so verify the server and client are in the same network. An enabled firewall can prevent the link from opening the ports.
The solution is to change the configuration of your system, or to use the LinkHost kernel configuration parameter to set the IP number for the link.