Bug 106008

Summary:	NaN values should not be skipped while importing data from ASCII file
Product:	[Applications] kst	Reporter:	Nicolas Brisset <nicolas.brisset>
Component:	general	Assignee:	kst
Status:	RESOLVED FIXED
Severity:	normal
Priority:	NOR
Version First Reported In:	1.x
Target Milestone:	---
Platform:	Compiled Sources
OS:	Linux
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:
Attachments:	Nice testcase (world map) Nice testcase (world map)

Description Nicolas Brisset 2005-05-20 09:54:41 UTC

Version:           1.1.0_beta2 (using KDE KDE 3.4.0)
Installed from:    Compiled From Sources
Compiler:          gcc 3.3.3 
OS:                Linux

When loading vectors from an ASCII files, NaN (Not a Number) values are skipped, which can result in pretty bad and hard to understand problems. For instance, imagine I have a file with X and Y vectors, with some data missing and replaced by some non-numeric text string in the data file (it sometimes happens in the real world!). If 10 values are missing for X and 20 for Y, when I plot Y vs X I may get erratic results because of the skipped values and automatic interpolation performed by kst.
I think NaN values should be preserved in the vectors, and resulting points skipped when plotting. If the plot uses lines, when there are such points they should be "cut".

Comment 1 Andrew Walker 2005-05-21 02:27:16 UTC

Just to clarify:

NaN points within vectors will not be plotted and will be skipped when plotting. The only problem is with the ascii datasource.

Comment 2 Andrew Walker 2005-06-16 21:10:40 UTC

Could you please attach a date file that shows the problem and details of what you're plotting against what.

Comment 3 Nicolas Brisset 2005-06-17 09:39:13 UTC

Created attachment 11485 [details]
Nice testcase (world map)

When the attached file is loaded (world.txt), NaN values are "skipped" at some
point (I don't think it happens at the datasource level, but rather in one of
the calling functions).
The expected result is a nice world map, without connection lines everywhere
:-)

Comment 4 Nicolas Brisset 2005-06-17 09:41:12 UTC

Created attachment 11486 [details]
Nice testcase (world map)

When the attached file is loaded (world.txt), NaN values are "skipped" at some
point (I don't think it happens at the datasource level, but rather in one of
the calling functions).
The expected result is a nice world map, without connection lines everywhere
:-)

Comment 5 Andrew Walker 2005-06-17 21:58:34 UTC

This problem does arise at the data source level. Any line that does not contain at least one digit is simply discarded. Thus, the data source will read in both of the following:

100.0 NaN
NaN 100.0

but will not read in:

NaN NaN

Comment 6 Andrew Walker 2005-06-17 22:03:28 UTC

This behaviour is defined in AsciiSource::update(int u). The question remaining is do we want to fix it? I would suggest that we do as if the data stream represents points sampled every second then a line comprised of all NaN values that is simply skipped will change the meaning of that data.

Comment 7 Andrew Walker 2005-06-18 00:33:55 UTC

I've made changes to both kst2dplot.cpp and ascii.cpp to fix this problem. The changes to ascii.cpp should be reviewed as things may run slower than before. If I don't get any feedback in the next couple of days I'll mark this bug report as fixed.

Changes to kst2dplot.cpp were necessary to correct problems with drawing when the x-value does not change across a NaN value.

Comment 8 Andrew Walker 2005-06-20 19:11:46 UTC

No comments, so all is well.

Comment 9 Netterfield 2005-06-21 15:31:30 UTC

Do you have a feeling as to how much slower things are?
Do not close the bug until this has been answered.

On Friday 17 June 2005 18:33, Andrew Walker wrote:
> problem. The changes to ascii.cpp should be reviewed as things may run
> slower than before. If I don't get any feedback in the next couple of days
> I'll mark this bug report as fixed.

Comment 10 Andrew Walker 2005-06-21 18:50:46 UTC

I can certainly do some timing on it to get
an estimate, but ultimately there may not be
much we can do if we want the data to be read
in correctly.

Andrew

-----Original Message-----
From: netterfield@astro.utoronto.ca
[mailto:netterfield@astro.utoronto.ca]
Sent: Tuesday, June 21, 2005 6:32 AM
To: kst@kde.org
Subject: [Kst] [Bug 106008] NaN values should not be skipped while
importingdata from ASCII file 


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
         
http://bugs.kde.org/show_bug.cgi?id=106008         




------- Additional Comments From netterfield astro utoronto ca  2005-06-21 15:31 -------
Do you have a feeling as to how much slower things are?
Do not close the bug until this has been answered.

On Friday 17 June 2005 18:33, Andrew Walker wrote:
> problem. The changes to ascii.cpp should be reviewed as things may run
> slower than before. If I don't get any feedback in the next couple of days
> I'll mark this bug report as fixed.

_______________________________________________
Kst mailing list
Kst@kde.org
https://mail.kde.org/mailman/listinfo/kst

Comment 11 Andrew Walker 2005-06-21 23:01:55 UTC

SVN commit 427790 by arwalker:

CCMAIL:106008@bugs.kde.org ascii datasource now runs faster than the original version, by about 40 percent. The only difference between the two is that a line such as 'X Y Z' would previously have been skipped, but is now treated (correctly I believe) as '0 0 0'

 M  +12 -5     ascii.cpp  


--- trunk/extragear/graphics/kst/kst/datasources/ascii/ascii.cpp #427789:427790
@@ -280,10 +280,12 @@
   }
 
   _valid = true;
-
+  
   int bufstart, bufread;
   bool new_data = false;
-  char tmpbuf[MAXBUFREADLEN];
+  char tmpbuf[MAXBUFREADLEN+1];
+  char* comment;
+  
   do {
     /* Read the tmpbuffer, starting at row_index[_numFrames] */
     if (_byteLength - _rowIndex[_numFrames] > MAXBUFREADLEN) {
@@ -295,10 +297,12 @@
     bufstart = _rowIndex[_numFrames];
     file.at(bufstart); // expensive?
     file.readBlock(tmpbuf, bufread);
-
+    tmpbuf[bufread] = '\0';
+    
     bool is_comment = false, has_dat = false;
+    comment = strpbrk(tmpbuf, _config->_delimiters.latin1());
     for (int i = 0; i < bufread; i++) {      
-      if (_config->_delimiters.contains(tmpbuf[i])) {    
+      if (comment == &(tmpbuf[i])) {    
         is_comment = true;
       } else if (tmpbuf[i] == '\n' || tmpbuf[i] == '\r') {
         if (has_dat) {
@@ -311,12 +315,15 @@
         }
         _rowIndex[_numFrames] = bufstart + i + 1;
         has_dat = is_comment = false;
+        if (comment && comment < &(tmpbuf[i])) {
+          comment = strpbrk(&(tmpbuf[i]), _config->_delimiters.latin1());
+        }
       } else if (!is_comment && !isspace(tmpbuf[i])) {   
         has_dat = true;
       }
     }
   } while (bufread == MAXBUFREADLEN);
-
+  
   file.close();
 
   updateNumFramesScalar();

Comment 12 Nicolas Brisset 2005-06-22 08:59:56 UTC

I believe it would be better for a line like 'X Y Z' to generate 'NaN' values in the vectors, so that curves using these vectors are not drawn at these places. If non-numeric values are turned into 0, how can you know that these values are missing ? It could very well happen that a 0 value is physically acceptable for a vector (i.e. you can't tell there was a missing value just because it's 0), while very different from NaN (no data available, e.g. broken sensor or whatever).

Comment 13 Netterfield 2005-06-22 14:44:38 UTC

On June 22, 2005 02:59 am, Nicolas Brisset wrote:
> 08:59 ------- I believe it would be better for a line like 'X Y Z' to
> generate 'NaN' values in the vectors, so that curves using these vectors
> are not drawn at these places. 

Only if this can be done with no speed impact in the data source.  Otherwise, 
the current behaviour is acceptable.

Comment 14 Andrew Walker 2005-06-23 01:00:30 UTC

SVN commit 428063 by arwalker:

CCMAIL:106008@bugs.kde.org Interpret text strings as NaN instead of 0.0 with no appreciable increase in read time of data

 M  +6 -0      ascii.cpp  


--- trunk/extragear/graphics/kst/kst/datasources/ascii/ascii.cpp #428062:428063
@@ -414,6 +414,9 @@
             i_col++;
             if (i_col == col) {
               v[i] = atof(_tmpBuf + ch);
+              if (v[i] == 0.0 && !isdigit(_tmpBuf[ch])) {
+                v[i] = atof("NaN");
+              }
               break;
             }
           }
@@ -441,6 +444,9 @@
             i_col++;
             if (i_col == col) {
               v[i] = atof(_tmpBuf + ch);
+              if (v[i] == 0.0 && !isdigit(_tmpBuf[ch])) {
+                v[i] = atof("NaN");
+              }
               break;
             }
           }