Difference between revisions of "HOWTO understand and find cause of terminated with signal errors"

From Nsnam
Jump to: navigation, search
(What the Problem Means)
 
(8 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
   <span style="color:green">Waf: Leaving directory `/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build'</span>
 
   <span style="color:green">Waf: Leaving directory `/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build'</span>
 
   <span style="color:green">'build' finished successfully (0.881s)</span>
 
   <span style="color:green">'build' finished successfully (0.881s)</span>
   <span style="color:red">Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] terminated with signal SIGSEGV. Run it under a debugger to get more information (./waf --run <program> --command-template="gdb --args %s <args>").</span>
+
   <span style="color:red">Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs']  
 +
  terminated with signal SIGSEGV. Run it under a debugger to get more information  
 +
  (./waf --run <program> --command-template="gdb --args %s <args>").</span>
  
 
In this HOWTO, we describe what this means and how you can go about finding your problem.
 
In this HOWTO, we describe what this means and how you can go about finding your problem.
Line 72: Line 74:
 
The short answer is that the program did not return a zero as its exit or return code.  Waf reports this back in red since it usually means that the program has failed in some way.  This return code can either come from the return value from your ''main'' function, or it can be supplied by the operating system or run-time system if your program does not complete for some reason.   
 
The short answer is that the program did not return a zero as its exit or return code.  Waf reports this back in red since it usually means that the program has failed in some way.  This return code can either come from the return value from your ''main'' function, or it can be supplied by the operating system or run-time system if your program does not complete for some reason.   
  
In general, strictly positive return codes indicate a program that completed "normally" (that is, the main function returned some value) but detected some error.  In the code above, the ''hs'' program completed normally, but returned the value one.  In real-world programs, this would indicate an error condition that you as a user could look up in the ''hs'' documentation and interpret.
+
In general, strictly positive return codes indicate a program that completed "normally" (that is, the main function returned some value) but detected some error.  In the code above, the ''hs'' program completed normally, but returned the value one.  In real-world programs, this would indicate an error condition that you as a user could look up in the ''hs'' documentation and interpret.  In this case, waf does not know what the return value of one means, so it simply reports that value.
  
 
Negative return codes typically indicate that the program has failed in some way such that it cannot complete.  In Unix and Linux, these codes are usually the negative of a so-called SIGNAL.  You can find a list of signals in ''/usr/include/asm/signal.h'' if you are interested.  The first few are:
 
Negative return codes typically indicate that the program has failed in some way such that it cannot complete.  In Unix and Linux, these codes are usually the negative of a so-called SIGNAL.  You can find a list of signals in ''/usr/include/asm/signal.h'' if you are interested.  The first few are:
Line 93: Line 95:
 
   #define SIGTERM  15
 
   #define SIGTERM  15
  
in this case, you may infer that if your program returns an exit code of "-11" the root cause is something called a ''SIGSEGV'' signal since its defined value is 11, which is the negative of -11.
+
Since these negative error codes are well known, waf can translate a return value that is really a negative integer into a string value.  A common case is if excuting your program causes the operating system to intervene and return an exit code of -11.  The root cause is something called a ''SIGSEGV'' signal since its defined value is 11, which is the negative of -11.  Waf converts the -11 to the string "signal SIGSEGV" and prints this as its output.
  
 
Google is your friend.  If you search for sigsegv, you will find a nice Wikipedia entry: http://en.wikipedia.org/wiki/SIGSEGV which then points you to another Wikipedia page:  http://en.wikipedia.org/wiki/Segmentation_fault
 
Google is your friend.  If you search for sigsegv, you will find a nice Wikipedia entry: http://en.wikipedia.org/wiki/SIGSEGV which then points you to another Wikipedia page:  http://en.wikipedia.org/wiki/Segmentation_fault
Line 109: Line 111:
 
This is what is happening when you run your program and you see the dreaded red message from waf:
 
This is what is happening when you run your program and you see the dreaded red message from waf:
  
   <span style="color:red">Command ['/your/directory/path/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] exited with code -11</span>
+
   <span style="color:red">Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs']  
 +
  terminated with signal SIGSEGV. Run it under a debugger to get more information (./waf --run <program>
 +
  --command-template="gdb --args %s <args>").</span>
  
 
=== Let's Reproduce One of Those ===
 
=== Let's Reproduce One of Those ===
Line 122: Line 126:
 
   }
 
   }
  
If you build and run, you should now see that waf highlights the fact that your program crashes with a segmentation fault by displaying the infamous red line:
+
If you build and run, you should now see that waf highlights the fact that your program now crashes with a segmentation fault by displaying the infamous red line:
  
 
   <span style="color:green">Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'</span>
 
   <span style="color:green">Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'</span>
Line 128: Line 132:
 
   <span style="color:green">'build' finished successfully (0.872s)</span>
 
   <span style="color:green">'build' finished successfully (0.872s)</span>
 
   Hello Simulator
 
   Hello Simulator
   <span style="color:red">Command ['/your/directory/path/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] exited with code -11</span>
+
   <span style="color:red">Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs']  
 +
  terminated with signal SIGSEGV. Run it under a debugger to get more information (./waf --run <program>
 +
  --command-template="gdb --args %s <args>").</span>
  
 
What you have done by adding the line,
 
What you have done by adding the line,
Line 136: Line 142:
 
is to try to write a zero byte to address zero of your system.  In every system that I can think of, address zero is located in a reserved system page that most likely includes important things like reset vectors which users must not be allowed to change.  Therefore, this access must be illegal for several reasons; and your operating system detects this attempt to modify the page and summarily stops your program.  This is called "a crash."
 
is to try to write a zero byte to address zero of your system.  In every system that I can think of, address zero is located in a reserved system page that most likely includes important things like reset vectors which users must not be allowed to change.  Therefore, this access must be illegal for several reasons; and your operating system detects this attempt to modify the page and summarily stops your program.  This is called "a crash."
  
So, when your program exits with a SIGSEGV, it has done something that the operating system considers as bad.  The red line with the error code from waf is simply telling you what has happened.  Your next job is to figure out what you did that the operating system doesn't like.
+
So, when your program exits with a SIGSEGV, it has done something that the operating system considers as bad with respect to accessing memory.  The red line with the error code from waf is simply telling you what has happened.  Your next job is to figure out what you did that the operating system doesn't like.
  
 
=== Finding and Fixing the Problem ===
 
=== Finding and Fixing the Problem ===
Line 150: Line 156:
 
   sudo apt-get install insight
 
   sudo apt-get install insight
  
There are two basic ways to run a program under a debugger in ns-3.  You can run the program using a so-called ''command-template''
+
There are two basic ways to run a program under a debugger in ns-3.  You can run the program using a so-called ''command-template'' as is suggested by waf in its error message:
  
 
   ./waf --run hs --command-template="insight %s"
 
   ./waf --run hs --command-template="insight %s"
Line 160: Line 166:
 
   insight hs
 
   insight hs
  
In either case, you will end up with a new window -- an insight source window.  If you click the little "running man" icon on the toolbar right under the "File" menu item, a breakpoint (google is your friend) will automatically be set for you at the start of the ''main'' function and your program will be started and run.  Execution of your program will be stopped at the first source line in ''main'' which is the NS_LOG_UNCOND that prints "Hello Simulator".  The fact that the program has stopped is indicated to you by the green background at the source line.  You should be seeing a window that looks like the following;
+
If you choose insight, in either case, you will end up with a new window -- an insight source window.  If you choose gdb (as waf suggests) you will in up with a command line debugger.  I'll assume you went with insight from here on.
 +
 
 +
If you click the little "running man" icon on the toolbar right under the "File" menu item (see the image below), a breakpoint (google is still your friend) will automatically be set for you at the start of the ''main'' function and your program will be started and run.  Execution of your program will be stopped at the first source line in ''main'' which is the NS_LOG_UNCOND that prints "Hello Simulator".  The fact that the program has stopped is indicated to you by the green background at the source line.  You should be seeing a window that looks like the following;
  
 
[[Image:insight.png]]
 
[[Image:insight.png]]
Line 171: Line 179:
  
 
which caused a segmentation fault by attempting to write to a system page (outside the valid address space of your program).
 
which caused a segmentation fault by attempting to write to a system page (outside the valid address space of your program).
 +
 +
=== More Errors ===
 +
 +
From the list of signals above, you may have already figured out that there are a number of errors that will cause the operating system to intervene and stop your program.  Let's illustrate another common case.  Pull up the file you created (scratch/hs.cc) in you favorite programmer's editor and change that line you added so that the main funtion looks like this:
 +
 +
    int
 +
  main (int argc, char *argv[])
 +
  {
 +
    NS_LOG_UNCOND ("Hello Simulator");
 +
    int i0 = 0, i1 = 1;
 +
    NS_LOG_UNCOND (i1 / i0);
 +
  }
 +
 +
If you build and run, you should now see that waf highlights the fact that your program crashes with a different fault:
 +
 +
  <span style="color:green">Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'</span>
 +
  <span style="color:green">Waf: Leaving directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'</span>
 +
  <span style="color:green">'build' finished successfully (0.872s)</span>
 +
  Hello Simulator
 +
  <span style="color:red">Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs']
 +
  terminated with signal SIGFPE. Run it under a debugger to get more information (./waf --run <program>
 +
  --command-template="gdb --args %s <args>").</span>
 +
 +
What you have done by adding the lines,
 +
 +
    int i0 = 0, i1 = 1;
 +
    NS_LOG_UNCOND (i1 / i0);
 +
 +
is to cause a division-by-zero error in your program.  The name SIGFPE indicates a Floating Point Exception (FPE) even though there are no floating point numbers in sight.  It turns out that SIGFPE is used for general arithmetic exceptions.  Again, google is your friend if you want to figure out exactly what these signals mean.  Wikipedia provides a page dedicated to SIGFPE (http://en.wikipedia.org/wiki/SIGFPE).
 +
 +
So, when your program exits with a SIGFPE, it has done something that the operating system considers as bad from an arithmentic point of view.  The red line with the error code from waf is simply telling you what has happened.  Your job will be to figure out what the system didn't like (from looking at the offending source line and the variables involved).
  
 
=== What Next ===
 
=== What Next ===
Line 182: Line 221:
 
Good luck and happy debugging!
 
Good luck and happy debugging!
 
----
 
----
 
+
[[User:Craigdo|Craigdo]] 02:34, 24 April 2010 (UTC)
[[User:Craigdo|Craigdo]] 01:53, 24 April 2010 (UTC)
+

Latest revision as of 03:24, 25 April 2010

Main Page - Current Development - Developer FAQ - Tools - Related Projects - Project Ideas - Summer Projects

Installation - Troubleshooting - User FAQ - HOWTOs - Samples - Models - Education - Contributed Code - Papers

One of the most common questions we hear on the ns-3 developers list is a variation on the following theme: I wrote my program, but when I run it I get a red line that ends with "terminated with signal SIGSEGV". Please tell me what I did wrong.

The complete output from waf will look something like,

 ./waf --run hs
 Waf: Entering directory `/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build'
 Waf: Leaving directory `/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build'
 'build' finished successfully (0.881s)
 Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] 
 terminated with signal SIGSEGV. Run it under a debugger to get more information 
 (./waf --run <program> --command-template="gdb --args %s <args>").

In this HOWTO, we describe what this means and how you can go about finding your problem.

HOWTO understand and find cause of terminated with signal errors

The zeroth thing to understand about debugging is that one of the least productive things you can do is post a pile of your code on a developers list and ask why it doesn't work. Developers are very busy people who won't have a lot of spare time to do your work for you. Try and figure it out on your own. You are doing the right thing by reading this page!

The first thing to understand is that debugging anything is an art and skill that you need to learn. Some jokers have observed that programming can be defined as the act of introducing bugs. This is not too far from the truth (which is why it is funny). Since you will be programming in the ns-3 environmnent, you are going to have to develop debugging skills, whether you like it or not, in order to remove the bugs you create. This HOWTO is only going to scratch the surface of the subject of debugging and hopefully provide you with a direction and a few hints regarding how to start. You are going to have to figure most of this out on your own, though. Don't worry, it gets easier.

It will be much easier on you if you learn from the experience of others. There are many books available that will help you learn the details of this huge subject. If you go to Amazon.com and search for "debugging" in their books section, you will find over 2,000 results. A couple of books that have been recommended on ns-developers are

  • Agans, "Debugging"
  • Matloff and Salzman, "The Art of Debugging with GDB, DDD, and Eclipse"

Reproduce the Problem

If you have read a good book on debugging, you will know that the first step in finding any problem is to figure out how to reproduce it. In this case, we need to produce it, so let's take the simplest ns-3 example and create a reproducible problem.

The hello-simulator.cc example, you may recall, just prints the text "Hello Simulator" on your console using the ns-3 logging system. It is simple enough that we can reproduce it here:

 #include "ns3/core-module.h"
 NS_LOG_COMPONENT_DEFINE ("HelloSimulator");
 using namespace ns3;
 
   int 
 main (int argc, char *argv[])
 {
   NS_LOG_UNCOND ("Hello Simulator");
 }

Go ahead and copy the example into the scratch directory. The following assumes that you are in the base directory of an ns-3 distribution (the directory where RELEASE, VERSION and src are found).

 cp examples/tutorial/hello-simulator.cc scratch/hs.cc

Now pull up the file you just created (scratch/hs.cc) in you favorite programmer's editor and add a line so that the main funtion looks like this:

   int 
 main (int argc, char *argv[])
 {
   NS_LOG_UNCOND ("Hello Simulator");
   return 1;
 }

Go ahead and build and run the new program:

 ./waf
 ./waf --run hs

You should see something that looks like:

 Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 Waf: Leaving directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 'build' finished successfully (0.872s)
 Hello Simulator
 Command ['/your/directory/path/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] exited with code 1

You should now have a reproducible bug, since if you repeat the waf run command, your program exits with code 1 every time.

What the Problem Means

The short answer is that the program did not return a zero as its exit or return code. Waf reports this back in red since it usually means that the program has failed in some way. This return code can either come from the return value from your main function, or it can be supplied by the operating system or run-time system if your program does not complete for some reason.

In general, strictly positive return codes indicate a program that completed "normally" (that is, the main function returned some value) but detected some error. In the code above, the hs program completed normally, but returned the value one. In real-world programs, this would indicate an error condition that you as a user could look up in the hs documentation and interpret. In this case, waf does not know what the return value of one means, so it simply reports that value.

Negative return codes typically indicate that the program has failed in some way such that it cannot complete. In Unix and Linux, these codes are usually the negative of a so-called SIGNAL. You can find a list of signals in /usr/include/asm/signal.h if you are interested. The first few are:

 #define SIGHUP   1
 #define SIGINT   2
 #define SIGQUIT  3
 #define SIGILL   4
 #define SIGTRAP  5
 #define SIGABRT  6
 #define SIGIOT   6
 #define SIGBUS   7
 #define SIGFPE   8
 #define SIGKILL  9
 #define SIGUSR1  10
 #define SIGSEGV  11
 #define SIGUSR2  12
 #define SIGPIPE  13
 #define SIGALRM  14
 #define SIGTERM  15

Since these negative error codes are well known, waf can translate a return value that is really a negative integer into a string value. A common case is if excuting your program causes the operating system to intervene and return an exit code of -11. The root cause is something called a SIGSEGV signal since its defined value is 11, which is the negative of -11. Waf converts the -11 to the string "signal SIGSEGV" and prints this as its output.

Google is your friend. If you search for sigsegv, you will find a nice Wikipedia entry: http://en.wikipedia.org/wiki/SIGSEGV which then points you to another Wikipedia page: http://en.wikipedia.org/wiki/Segmentation_fault

On that page, you will find a reasonable definition of a segmentation violation:

 A segmentation fault (often shortened to segfault) or access violation is a 
 particular error condition that can occur during the operation of computer 
 software. A segmentation fault occurs when a program attempts to access a 
 memory location that it is not allowed to access, or attempts to access a
 memory location in a way that is not allowed (for example, attempting to 
 write to a read-only location, or to overwrite part of the operating 
 system)

This is what is happening when you run your program and you see the dreaded red message from waf:

 Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] 
 terminated with signal SIGSEGV. Run it under a debugger to get more information (./waf --run <program> 
 --command-template="gdb --args %s <args>").

Let's Reproduce One of Those

Pull up the file you created (scratch/hs.cc) in you favorite programmer's editor and change that line you added so that the main funtion looks like this:

   int 
 main (int argc, char *argv[])
 {
   NS_LOG_UNCOND ("Hello Simulator");
   *(char *)0 = 0;
 }

If you build and run, you should now see that waf highlights the fact that your program now crashes with a segmentation fault by displaying the infamous red line:

 Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 Waf: Leaving directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 'build' finished successfully (0.872s)
 Hello Simulator
 Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs'] 
 terminated with signal SIGSEGV. Run it under a debugger to get more information (./waf --run <program> 
 --command-template="gdb --args %s <args>").

What you have done by adding the line,

   *(char *)0 = 0;

is to try to write a zero byte to address zero of your system. In every system that I can think of, address zero is located in a reserved system page that most likely includes important things like reset vectors which users must not be allowed to change. Therefore, this access must be illegal for several reasons; and your operating system detects this attempt to modify the page and summarily stops your program. This is called "a crash."

So, when your program exits with a SIGSEGV, it has done something that the operating system considers as bad with respect to accessing memory. The red line with the error code from waf is simply telling you what has happened. Your next job is to figure out what you did that the operating system doesn't like.

Finding and Fixing the Problem

Since there are literally an infinte number of ways you can introduce a segmentation violation into your code, there is no way I can tell you how to fix your code. What I can do is to explain how to run your program in a debugger so you can see the point at which the operating system decided your program has gone bad. There are many debuggers, and you will probably come to know and love gdb for its power and ubiquity. Let's start with something small, though. For beginners, a graphical debugger is probably the way to go, and insight is fairly intuitive to use. It turns out that insight is actually a graphical wrapper for gdb, so you can eventually get to the more powerful gdb features as you learn more; so this isn't a completely pointless exercise :-)

If your system does not come with insight, you can install the package simply by using either

 sudo yum install insight

or

 sudo apt-get install insight

There are two basic ways to run a program under a debugger in ns-3. You can run the program using a so-called command-template as is suggested by waf in its error message:

 ./waf --run hs --command-template="insight %s"

or you can enter a shell and change into the appropriate directory and run the degugger directly

 ./waf shell
 cd build/debug/scratch
 insight hs

If you choose insight, in either case, you will end up with a new window -- an insight source window. If you choose gdb (as waf suggests) you will in up with a command line debugger. I'll assume you went with insight from here on.

If you click the little "running man" icon on the toolbar right under the "File" menu item (see the image below), a breakpoint (google is still your friend) will automatically be set for you at the start of the main function and your program will be started and run. Execution of your program will be stopped at the first source line in main which is the NS_LOG_UNCOND that prints "Hello Simulator". The fact that the program has stopped is indicated to you by the green background at the source line. You should be seeing a window that looks like the following;

Insight.png

To the right of the "running man" are some parenthesis icons that control execution of your program. Most of them have arrows that end up pointing down (stepping "down" into functions). One of them has a red arrow pointing to the right. This is the "continue" button. If you press this button, your program will "continue" running until it exits, hits another breakponit, or does something evil.

Go ahead and press the button. You will see a warning popup window appear that tells you that insight has "received signal SIGSEGV, Segmentation fault". You expected something like that, correct? If you dismiss the popup, insight will show you at which source line the program stopped by coloring its background green. In this case, the offending line is,

 *(char *)0 = 0;

which caused a segmentation fault by attempting to write to a system page (outside the valid address space of your program).

More Errors

From the list of signals above, you may have already figured out that there are a number of errors that will cause the operating system to intervene and stop your program. Let's illustrate another common case. Pull up the file you created (scratch/hs.cc) in you favorite programmer's editor and change that line you added so that the main funtion looks like this:

   int 
 main (int argc, char *argv[])
 {
   NS_LOG_UNCOND ("Hello Simulator");
   int i0 = 0, i1 = 1;
   NS_LOG_UNCOND (i1 / i0);
 }

If you build and run, you should now see that waf highlights the fact that your program crashes with a different fault:

 Waf: Entering directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 Waf: Leaving directory `/your/directory/path/ns-3-allinone-dev/ns-3-dev/build'
 'build' finished successfully (0.872s)
 Hello Simulator
 Command ['/home/craigdo/repos/ns-3-allinone-dev/ns-3-dev/build/debug/scratch/hs']
 terminated with signal SIGFPE. Run it under a debugger to get more information (./waf --run <program> 
 --command-template="gdb --args %s <args>").

What you have done by adding the lines,

   int i0 = 0, i1 = 1;
   NS_LOG_UNCOND (i1 / i0);

is to cause a division-by-zero error in your program. The name SIGFPE indicates a Floating Point Exception (FPE) even though there are no floating point numbers in sight. It turns out that SIGFPE is used for general arithmetic exceptions. Again, google is your friend if you want to figure out exactly what these signals mean. Wikipedia provides a page dedicated to SIGFPE (http://en.wikipedia.org/wiki/SIGFPE).

So, when your program exits with a SIGFPE, it has done something that the operating system considers as bad from an arithmentic point of view. The red line with the error code from waf is simply telling you what has happened. Your job will be to figure out what the system didn't like (from looking at the offending source line and the variables involved).

What Next

Obviously, this HOWTO is not a place to provide a manual for the insight debugger, nor is it a place for general debugging references. You can attempt to push forward on your own by reading the insight documentation and trying to figure out debugging on your own. If you are new enough to this debugging thing to have learned anything in this HOWTO, I strongly recommend that you pick up one of the books on debugging techniques and start working through some of their examples. It will most likely save you a lot of stress.

Conclusion

As mentioned above, debugging is both an art and a skill and you can spend the rest of your life mastering it. Many of us learned the hard way by having many, many bugs master us. You can choose that way, the hard way, but we who have been down that road think it will pay off if you take a small break at this point and do some reading or ask a colleague with real experience in this area for some help and guidance.

Good luck and happy debugging!


Craigdo 02:34, 24 April 2010 (UTC)