BADVALUES(1) User Contributed Perl Documentation BADVALUES(1)NAMEPDL::BadValues - Discussion of bad value support in PDL
DESCRIPTION
What are bad values and why should I bother with them?
Sometimes it's useful to be able to specify a certain value is 'bad' or
'missing'; for example CCDs used in astronomy produce 2D images which
are not perfect since certain areas contain invalid data due to
imperfections in the detector. Whilst PDL's powerful index routines
and all the complicated business with dataflow, slices, etc etc mean
that these regions can be ignored in processing, it's awkward to do. It
would be much easier to be able to say "$c = $a + $b" and leave all the
hassle to the computer.
If you're not interested in this, then you may (rightly) be concerned
with how this affects the speed of PDL, since the overhead of checking
for a bad value at each operation can be large. Because of this, the
code has been written to be as fast as possible - particularly when
operating on piddles which do not contain bad values. In fact, you
should notice essentially no speed difference when working with piddles
which do not contain bad values.
However, if you do not want bad values, then PDL's "WITH_BADVAL"
configuration option comes to the rescue; if set to 0 or undef, the
bad-value support is ignored. About the only time I think you'll need
to use this - I admit, I'm biased ;) - is if you have limited disk or
memory space, since the size of the code is increased (see below).
You may also ask 'well, my computer supports IEEE NaN, so I already
have this'. Well, yes and no - many routines, such as "y=sin(x)", will
propagate NaN's without the user having to code differently, but
routines such as "qsort", or finding the median of an array, need to be
re-coded to handle bad values. For floating-point datatypes, "NaN" and
"Inf" are used to flag bad values IF the option "BADVAL_USENAN" is set
to 1 in your config file. Otherwise special values are used (Default
bad values). I do not have any benchmarks to see which option is
faster.
There is an experimental feature "BADVAL_PER_PDL" which, if set, allows
you to have different bad values for separate piddles of the same type.
This currently does not work with the "BADVAL_USENAN" option; if both
are set then PDL will ignore the "BADVAL_USENAN" value.
Code increase due to bad values
The following comparison is out of date!
On an i386 machine running Linux and Perl 5.005_03, I measured the
following sizes (the Slatec code was compiled in, but none of the other
options: e.g., FFTW, GSL, and TriD were):
WITH_BADVAL = 0
Size of blib directory after a successful make = 4963 kb: blib/arch
= 2485 kb and blib/lib = 1587 kb.
WITH_BADVAL = 1
Size of blib directory after a successful make = 5723 kb: blib/arch
= 3178 kb and blib/lib = 1613 kb.
So, the overall increase is only 15% - not much to pay for all the
wonders that bad values provides ;)
The source code used for this test had the vast majority of the core
routines (eg those in Basic/) converted to use bad values, whilst very
few of the 'external' routines (i.e. everything else in the PDL
distribution) had been changed.
A quick overview
pdl> p $PDL::Bad::Status
1
pdl> $a = sequence(4,3);
pdl> p $a
[
[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
]
pdl> $a = $a->setbadif( $a % 3 == 2 )
pdl> p $a
[
[ 0 1 BAD 3]
[ 4 BAD 6 7]
[BAD 9 10 BAD]
]
pdl> $a *= 3
pdl> p $a
[
[ 0 3 BAD 9]
[ 12 BAD 18 21]
[BAD 27 30 BAD]
]
pdl> p $a->sum
120
"demo bad" and "demo bad2" within perldl or pdl2 gives a demonstration
of some of the things possible with bad values. These are also
available on PDL's web-site, at http://pdl.perl.org/demos/. See
PDL::Bad for useful routines for working with bad values and t/bad.t to
see them in action.
The intention is to:
· not significantly affect PDL for users who don't need bad value
support
· be as fast as possible when bad value support is installed
If you never want bad value support, then you set "WITH_BADVAL" to 0 in
perldl.conf; PDL then has no bad value support compiled in, so will be
as fast as it used to be.
However, in most cases, the bad value support has a negligible affect
on speed, so you should set "WITH_CONFIG" to 1! One exception is if you
are low on memory, since the amount of code produced is larger (but
only by about 15% - see "Code increase due to bad values").
To find out if PDL has been compiled with bad value support, look at
the values of either $PDL::Config{WITH_BADVAL} or $PDL::Bad::Status -
if true then it has been.
To find out if a routine supports bad values, use the "badinfo" command
in perldl or pdl2 or the "-b" option to pdldoc. This facility is
currently a 'proof of concept' (or, more realistically, a quick hack)
so expect it to be rough around the edges.
Each piddle contains a flag - accessible via "$pdl->badflag" - to say
whether there's any bad data present:
· If false/0, which means there's no bad data here, the code supplied
by the "Code" option to "pp_def()" is executed. This means that the
speed should be very close to that obtained with "WITH_BADVAL=0",
since the only overhead is several accesses to a bit in the piddles
state variable.
· If true/1, then this says there MAY be bad data in the piddle, so
use the code in the "BadCode" option (assuming that the "pp_def()"
for this routine has been updated to have a BadCode key). You get
all the advantages of threading, as with the "Code" option, but it
will run slower since you are going to have to handle the presence
of bad values.
If you create a piddle, it will have its bad-value flag set to 0. To
change this, use "$pdl->badflag($new_bad_status)", where
$new_bad_status can be 0 or 1. When a routine creates a piddle, it's
bad-value flag will depend on the input piddles: unless over-ridden
(see the "CopyBadStatusCode" option to "pp_def"), the bad-value flag
will be set true if any of the input piddles contain bad values. To
check that a piddle really contains bad data, use the "check_badflag"
method.
NOTE: propogation of the badflag
If you change the badflag of a piddle, this change is propagated to all
the children of a piddle, so
pdl> $a = zeroes(20,30);
pdl> $b = $a->slice('0:10,0:10');
pdl> $c = $b->slice(',(2)');
pdl> print ">>c: ", $c->badflag, "\n";
>>c: 0
pdl> $a->badflag(1);
pdl> print ">>c: ", $c->badflag, "\n";
>>c: 1
No change is made to the parents of a piddle, so
pdl> print ">>a: ", $a->badflag, "\n";
>>a: 1
pdl> $c->badflag(0);
pdl> print ">>a: ", $a->badflag, "\n";
>>a: 1
Thoughts:
· the badflag can ONLY be cleared IF a piddle has NO parents, and
that this change will propagate to all the children of that piddle.
I am not so keen on this anymore (too awkward to code, for one).
· "$a->badflag(1)" should propagate the badflag to BOTH parents and
children.
This shouldn't be hard to implement (although an initial attempt
failed!). Does it make sense though? There's also the issue of what
happens if you change the badvalue of a piddle - should these propagate
to children/parents (yes) or whether you should only be able to change
the badvalue at the 'top' level - i.e. those piddles which do not have
parents.
The "orig_badvalue()" method returns the compile-time value for a given
datatype. It works on piddles, PDL::Type objects, and numbers - eg
$pdl->orig_badvalue(), byte->orig_badvalue(), and orig_badvalue(4).
It also has a horrible name...
To get the current bad value, use the "badvalue()" method - it has the
same syntax as "orig_badvalue()".
To change the current bad value, supply the new number to badvalue - eg
$pdl->badvalue(2.3), byte->badvalue(2), badvalue(5,-3e34).
Note: the value is silently converted to the correct C type, and
returned - i.e. "byte->badvalue(-26)" returns 230 on my Linux machine.
It is also a "nop" for floating-point types when "BADVAL_USENAN" is
true.
Note that changes to the bad value are NOT propagated to previously-
created piddles - they will still have the bad value set, but suddenly
the elements that were bad will become 'good', but containing the old
bad value. See discussion below. It's not a problem for floating-
point types which use NaN, since you can not change their badvalue.
Bad values and boolean operators
For those boolean operators in PDL::Ops, evaluation on a bad value
returns the bad value. Whilst this means that
$mask = $img > $thresh;
correctly propagates bad values, it will cause problems for checks such
as
do_something() if any( $img > $thresh );
which need to be re-written as something like
do_something() if any( setbadtoval( ($img > $thresh), 0 ) );
When using one of the 'projection' functions in PDL::Ufunc - such as
orover - bad values are skipped over (see the documentation of these
functions for the current (poor) handling of the case when all elements
are bad).
A bad value for each piddle, and related issues
An experimental option "BADVAL_PER_PDL" has been added to perldl.conf
to allow per-piddle bad values. The documentation has not been updated
to account for this change.
The following is relevant only for integer types, and for floating-
point types if "BADVAL_USENAN" was not set when PDL was built.
Currently, there is one bad value for each datatype. The code is
written so that we could have a separate bad value for each piddle
(stored in the pdl structure) - this would then remove the current
problem of:
pdl> $a = byte( 1, 2, byte->badvalue, 4, 5 );
pdl> p $a;
[1 2 255 4 5]
pdl> $a->badflag(1)
pdl> p $a;
[1 2 BAD 4 5]
pdl> byte->badvalue(0);
pdl> p $a;
[1 2 255 4 5]
ie the bad value in $a has lost its bad status using the current
implementation. It would almost certainly cause problems elsewhere
though!
IMPLEMENTATION DETAILS
During a "perl Makefile.PL", the file Basic/Core/badsupport.p is
created; this file contains the values of the "WITH_BADVAL",
"BADVAL_USENAN" and "BADVAL_PER_PDL" variables, and should be used by
code that is executed before the PDL::Config file is created (e.g.
Basic/Core/pdlcore.c.PL. However, most PDL code will just need to
access the %PDL::Config array (e.g. Basic/Bad/bad.pd) to find out
whether bad-value support is required.
A new flag has been added to the state of a piddle - "PDL_BADVAL". If
unset, then the piddle does not contain bad values, and so all the
support code can be ignored. If set, it does not guarantee that bad
values are present, just that they should be checked for. Thanks to
Christian, "badflag()" - which sets/clears this flag (see
Basic/Bad/bad.pd) - will update ALL the children/grandchildren/etc of a
piddle if its state changes (see "badflag" in Basic/Bad/bad.pd and
"propagate_badflag" in Basic/Core/Core.xs.PL). It's not clear what to
do with parents: I can see the reason for propagating a 'set badflag'
request to parents, but I think a child should NOT be able to clear the
badflag of a parent. There's also the issue of what happens when you
change the bad value for a piddle.
The "pdl_trans" structure has been extended to include an integer
value, "bvalflag", which acts as a switch to tell the code whether to
handle bad values or not. This value is set if any of the input piddles
have their "PDL_BADVAL" flag set (although this code can be replaced by
setting "FindBadStateCode" in pp_def). The logic of the check is going
to get a tad more complicated if I allow routines to fall back to using
the "Code" section for floating-point types (i.e. those routines with
"NoBadifNaN => 1" when "BADVAL_USENAN" is true).
The bad values for the integer types are now stored in a structure
within the Core PDL structure - "PDL.bvals" (eg
Basic/Core/pdlcore.h.PL); see also "typedef badvals" in
Basic/Core/pdl.h.PL and the BOOT code of Basic/Core/Core.xs.PL where
the values are initialised to (hopefully) sensible values. See
PDL/Bad/bad.pd for read/write routines to the values.
The addition of the "BADVAL_PER_PDL" option has resulted in additional
changes to the internals of piddles. These changes are not documented
yet.
Why not make a PDL subclass?
The support for bad values could have been done as a PDL sub-class.
The advantage of this approach would be that you only load in the code
to handle bad values if you actually want to use them. The downside is
that the code then gets separated: any bug fixes/improvements have to
be done to the code in two different files. With the present approach
the code is in the same "pp_def" function (although there is still the
problem that both "Code" and "BadCode" sections need updating).
Default bad values
The default/original bad values are set to (taken from the Starlink
distribution):
#include <limits.h>
PDL_Byte == UCHAR_MAX
PDL_Short == SHRT_MIN
PDL_Ushort == USHRT_MAX
PDL_Long == INT_MIN
If "BADVAL_USENAN == 0", then we also have
PDL_Float == -FLT_MAX
PDL_Double == -DBL_MAX
otherwise all of "NaN", "+Inf", and "-Inf" are taken to be bad for
floating-point types. In this case, the bad value can't be changed,
unlike the integer types.
How do I change a routine to handle bad values?
Examples can be found in most of the *.pd files in Basic/ (and
hopefully many more places soon!). Some of the logic might appear a
bit unclear - that's probably because it is! Comments appreciated.
All routines should automatically propagate the bad status flag to
output piddles, unless you declare otherwise.
If a routine explicitly deals with bad values, you must provide this
option to pp_def:
HandleBad => 1
This ensures that the correct variables are initialised for the $ISBAD
etc macros. It is also used by the automatic document-creation routines
to provide default information on the bad value support of a routine
without the user having to type it themselves (this is in its early
stages).
To flag a routine as NOT handling bad values, use
HandleBad => 0
This should cause the routine to print a warning if it's sent any
piddles with the bad flag set. Primitive's "intover" has had this set -
since it would be awkward to convert - but I've not tried it out to see
if it works.
If you want to handle bad values but not set the state of all the
output piddles, or if it's only one input piddle that's important, then
look at the PP rules "NewXSFindBadStatus" and "NewXSCopyBadStatus" and
the corresponding "pp_def" options:
FindBadStatusCode
By default, "FindBadStatusCode" creates code which sets
"$PRIV(bvalflag)" depending on the state of the bad flag of the
input piddles: see "findbadstatus" in Basic/Gen/PP.pm. User-
defined code should also store the value of "bvalflag" in the
"$BADFLAGCACHE()" variable.
CopyBadStatusCode
The default code here is a bit simpler than for
"FindBadStatusCode": the bad flag of the output piddles are set if
"$BADFLAGCACHE()" is true after the code has been evaluated.
Sometimes "CopyBadStatusCode" is set to an empty string, with the
responsibility of setting the badflag of the output piddle left to
the "BadCode" section (e.g. the "xxxover" routines in
Basic/Primitive/primitive.pd).
Prior to PDL 2.4.3 we used "$PRIV(bvalflag)" instead of
"$BADFLAGCACHE()". This is dangerous since the "$PRIV()" structure
is not guaranteed to be valid at this point in the code.
If you have a routine that you want to be able to use as in-place, look
at the routines in bad.pd (or ops.pd) which use the "in-place" option
to see how the bad flag is propagated to children using the
"xxxBadStatusCode" options. I decided not to automate this as rules
would be a little complex, since not every in-place op will need to
propagate the badflag (eg unary functions).
If the option
HandleBad => 1
is given, then many things happen. For integer types, the readdata
code automatically creates a variable called "<pdl name>_badval", which
contains the bad value for that piddle (see "get_xsdatapdecl()" in
Basic/Gen/PP/PdlParObjs.pm). However, do not hard code this name into
your code! Instead use macros (thanks to Tuomas for the suggestion):
'$ISBAD(a(n=>1))' expands to '$a(n=>1) == a_badval'
'$ISGOOD(a())' '$a() != a_badval'
'$SETBAD(bob())' '$bob() = bob_badval'
well, the "$a(...)" is expanded as well. Also, you can use a "$" before
the pdl name, if you so wish, but it begins to look like line noise -
eg "$ISGOOD($a())".
If you cache a piddle value in a variable -- eg "index" in slices.pd --
the following routines are useful:
'$ISBADVAR(c_var,pdl)' 'c_var == pdl_badval'
'$ISGOODVAR(c_var,pdl)' 'c_var != pdl_badval'
'$SETBADVAR(c_var,pdl)' 'c_var = pdl_badval'
The following have been introduced, They may need playing around with
to improve their use.
'$PPISBAD(CHILD,[i]) 'CHILD_physdatap[i] == CHILD_badval'
'$PPISGOOD(CHILD,[i]) 'CHILD_physdatap[i] != CHILD_badval'
'$PPSETBAD(CHILD,[i]) 'CHILD_physdatap[i] = CHILD_badval'
If "BADVAL_USENAN" is set, then it's a bit different for "float" and
"double", where we consider "NaN", "+Inf", and "-Inf" all to be bad. In
this case:
ISBAD becomes finite(piddle) == 0
ISGOOD finite(piddle) != 0
SETBAD piddle = NaN
where the value for NaN is discussed below in Handling NaN values.
This all means that you can change
Code => '$a() = $b() + $c();'
to
BadCode => 'if ( $ISBAD(b()) || $ISBAD(c()) ) {
$SETBAD(a());
} else {
$a() = $b() + $c();
}'
leaving Code as it is. PP::PDLCode will then create a loop something
like
if ( __trans->bvalflag ) {
threadloop over BadCode
} else {
threadloop over Code
}
(it's probably easier to just look at the .xs file to see what goes
on).
Going beyond the Code section
Similar to "BadCode", there's "BadBackCode", and "BadRedoDimsCode".
Handling "EquivCPOffsCode" is a bit different: under the assumption
that the only access to data is via the "$EQUIVCPOFFS(i,j)" macro, then
we can automatically create the 'bad' version of it; see the
"[EquivCPOffsCode]" and "[Code]" rules in PDL::PP.
Macro access to the bad flag of a piddle
Macros have been provided to provide access to the bad-flag status of a
pdl:
'$PDLSTATEISBAD(a)' -> '($PDL(a)->state & PDL_BADVAL) > 0'
'$PDLSTATEISGOOD(a)' '($PDL(a)->state & PDL_BADVAL) == 0'
'$PDLSTATESETBAD(a)' '$PDL(a)->state |= PDL_BADVAL'
'$PDLSTATESETGOOD(a)' '$PDL(a)->state &= ~PDL_BADVAL'
For use in "xxxxBadStatusCode" (+ other stuff that goes into the INIT:
section) there are:
'$SETPDLSTATEBAD(a)' -> 'a->state |= PDL_BADVAL'
'$SETPDLSTATEGOOD(a)' -> 'a->state &= ~PDL_BADVAL'
'$ISPDLSTATEBAD(a)' -> '((a->state & PDL_BADVAL) > 0)'
'$ISPDLSTATEGOOD(a)' -> '((a->state & PDL_BADVAL) == 0)'
In PDL 2.4.3 the "$BADFLAGCACHE()" macro was introduced for use in
"FindBadStatusCode" and "CopyBadStatusCode".
Handling NaN values
There are two issues:
NaN as the bad value
which is done. To select, set "BADVAL_USENAN" to 1 in perldl.conf;
a value of 0 falls back to treating the floating-point types the
same as the integers. I need to do some benchmarks to see which is
faster, and whether it's dependent on machines (Linux seems to slow
down much more than my Sparc machine in some very simple tests I
did).
Ignoring BadCode sections
which is not.
For simple routines processing floating-point numbers, we should let
the computer process the bad values (i.e. "NaN" and "Inf" values)
instead of using the code in the "BadCode" section. Many such routines
have been labeled using "NoBadifNaN => 1"; however this is currently
ignored by PDL::PP.
For these routines, we want to use the "Code" section if
the piddle does not have its bad flag set
the datatype is a float or double
otherwise we use the "BadCode" section. This is NOT IMPLEMENTED, as it
will require reasonable hacking of PP::PDLCode!
There's also the problem of how we handle 'exceptions' - since "$a =
pdl(2) / pdl(0)" produces a bad value but doesn't update the badflag
value of the piddle. Can we catch an exception, or do we have to trap
for this (e.g. search for "exception" in Basic/Ops/ops.pd)?
Checking for "Nan", and "Inf" is done by using the "finite()" system
call. If you want to set a value to the "NaN" value, the following bit
of code can be used (this can be found in both Basic/Core/Core.xs.PL
and Basic/Bad/bad.pd):
/* for big-endian machines */
static union { unsigned char __c[4]; float __d; }
__pdl_nan = { { 0x7f, 0xc0, 0, 0 } };
/* for little-endian machines */
static union { unsigned char __c[4]; float __d; }
__pdl_nan = { { 0, 0, 0xc0, 0x7f } };
This approach should probably be replaced by library routines such as
"nan("")" or "atof("NaN")".
To find out whether a particular machine is big endian, use the routine
"PDL::Core::Dev::isbigendian()".
WHAT ABOUT DOCUMENTATION?
One of the strengths of PDL is it's on-line documentation. The aim is
to use this system to provide information on how/if a routine supports
bad values: in many cases "pp_def()" contains all the information
anyway, so the function-writer doesn't need to do anything at all! For
the cases when this is not sufficient, there's the "BadDoc" option. For
code written at the Perl level - i.e. in a .pm file - use the "=for
bad" pod directive.
This information will be available via man/pod2man/html documentation.
It's also accessible from the "perldl" or "pdl2" shells - using the
"badinfo" command - and the "pdldoc" shell command - using the "-b"
option.
This support is at a very early stage - i.e. not much thought has gone
into it: comments are welcome; improvements to the code preferred ;)
One awkward problem is for *.pm code: you have to write a *.pm.PL file
which only inserts the "=for bad" directive (+ text) if bad value
support is compiled in. In fact, this is a pain when handling bad
values at the Perl, rather than PDL::PP, level: perhaps I should just
scrap the "WITH_BADVAL" option...
CURRENT ISSUES
There are a number of areas that need work, user input, or both! They
are mentioned elsewhere in this document, but this is just to make sure
they don't get lost.
Trapping invalid mathematical operations
Should we add exceptions to the functions in "PDL::Ops" to set the
output bad for out-of-range input values?
pdl> p log10(pdl(10,100,-1))
I would like the above to produce "[1 2 BAD]", but this would slow down
operations on all piddles. We could check for "NaN"/"Inf" values after
the operation, but I doubt that would be any faster.
Integration with NaN
When "BADVAL_USENAN" is true, the routines in "PDL::Ops" should just
fall through to the "Code" section - i.e. don't use "BadCode" - for
"float" and "double" data types.
Global versus per-piddle bad values
I think all that's needed is to change the routines in
"Basic/Core/pdlconv.c.PL", although there's bound to be complications.
It would also mean that the pdl structure would need to have a variable
to store its bad value, which would mean binary incompatibility with
previous versions of PDL with bad value support.
As of 17 March 2006, PDL contains the experimental "BADVAL_PER_PDL"
configuration option which, if selected, adds per-piddle bad values.
Dataflow of the badflag
Currently changes to the bad flag are propagated to the children of a
piddle, but perhaps they should also be passed on to the parents as
well. With the advent of per-piddle bad values we need to consider how
to handle changes to the value used to represent bad items too.
EVERYTHING ELSE
The build process has been affected. The following files are now
created during the build:
Basic/Core/pdlcore.h pdlcore.h.PL
pdlcore.c pdlcore.c.PL
pdlapi.c pdlapi.c.PL
Core.xs Core.xs.PL
Core.pm Core.pm.PL
Several new files have been added:
Basic/Pod/BadValues.pod (i.e. this file)
t/bad.t
Basic/Bad/
Basic/Bad/Makefile.PL
bad.pd
IO/NDF/NDF.xs.PL
etc
TODO/SUGGESTIONS
· Look at using per-piddle bad values. Would mean a change to the
pdl structure (i.e. binary incompatibility) and the routines in
"Basic/Core/pdlconv.c.PL" would need changing to handle this. Most
other routines should not need to be changed ...
See the experimental "BADVAL_PER_PDL" option.
· what to do about "$b = pdl(-2); $a = log10($b)" - $a should be set
bad, but it currently isn't.
· Allow the operations in PDL::Ops to skip the check for bad values
when using NaN as a bad value and processing a floating-point
piddle. Needs a fair bit of work to PDL::PP::PDLCode.
· "$pdl->baddata()" now updates all the children of this piddle as
well. However, not sure what to do with parents, since:
$b = $a->slice();
$b->baddata(0)
doesn't mean that $a shouldn't have it's badvalue cleared.
however, after
$b->baddata(1)
it's sensible to assume that the parents now get flagged as
containing bad values.
PERHAPS you can only clear the bad value flag if you are NOT a
child of another piddle, whereas if you set the flag then all
children AND parents should be set as well?
Similarly, if you change the bad value in a piddle, should this be
propagated to parent & children? Or should you only be able to do
this on the 'top-level' piddle? Nasty...
· get some code set up to do benchmarks to see how much things are
slowed down (and to check that I haven't messed things up if
"WITH_BADVAL" is 0/undef).
· some of the names aren't appealing - I'm thinking of
"orig_badvalue()" in Basic/Bad/bad.pd in particular. Any
suggestions appreciated.
AUTHOR
Copyright (C) Doug Burke (djburke@cpan.org), 2000, 2006.
The per-piddle bad value support is by Heiko Klein (2006).
Commercial reproduction of this documentation in a different format is
forbidden.
perl v5.14.1 2011-03-30 BADVALUES(1)