Bug 474732

Summary: Number of generated values from generate data with stdev or similar column statistic functions seems to be wrong
Product: [Applications] LabPlot2 Reporter: pcfreak115
Component: generalAssignee: Alexander Semke <alexander.semke>
Status: RESOLVED FIXED    
Severity: normal CC: martin.marmsoler
Priority: NOR    
Version First Reported In: 2.10.0   
Target Milestone: ---   
Platform: Manjaro   
OS: Linux   
Latest Commit: Version Fixed/Implemented In: 2.11
Sentry Crash Report:
Attachments: The case in which the table is unnecassarily expanded
The case in which the table has not enough error values.
new option to auto resize the target column

Description pcfreak115 2023-09-20 11:20:54 UTC
STEPS TO REPRODUCE
1. Create column with numbers in it
2. Create second column in another table
3. For this second column, use Generate Data and then the function stdev(x), where for x the column from step 1 is used

OBSERVED RESULT
The second column contains exactly as many entries as there are in the first column

EXPECTED RESULT
In the case of statistics, e.g. stdev, or any other function which returns a single number for a complete column, the amount of entries in the second column should depend on the size of this second column/table, not on the amount of entries the statistic is based on i.e. in this case the first column.

ADDITIONAL INFORMATION
As another example, consider having a data set with e.g. 100 entries. I would like to use their standard deviation as error in another data set with e.g. 200 entries. Using the generate functionality, however, only the first 100 rows for the error are filled with the standard deviation, although I would expect it to fill my whole column, in this case all 200 entries (or whatever the column length is set to).

Should the other data set only contain, for example, 50 entries, my table gets unnecessarily expanded to fit 100 values for the error, even though i just need 50. Manually reducing the table size works until the first data set which I want to use the standard deviation of gets somehow changed, then the table with my 50 entries is expanded once again (given I mark the auto update checkbox in the generate data dialog).

In this second case, this bug is just a bit annoying but in the first case, I would be missing error values which is problematic.
Comment 1 Alexander Semke 2023-09-24 19:24:36 UTC
(In reply to pcfreak115 from comment #0)
> STEPS TO REPRODUCE
> 1. Create column with numbers in it
> 2. Create second column in another table
> 3. For this second column, use Generate Data and then the function stdev(x),
> where for x the column from step 1 is used
> 
> OBSERVED RESULT
> The second column contains exactly as many entries as there are in the first
> column
> [...]
Right now we adjust the size of the target column/spreadsheet which makes sense in very many cases. But your points are valid, of course. In case spreadsheets with different sizes are involved, we should warn the user and also allow to decide whether the sizes should be adjusted or not. We'll implement this.

May I ask you about your scenario? Why do you need to have the standard deviation 50 times?
Comment 2 pcfreak115 2023-09-24 21:15:05 UTC
(In reply to Alexander Semke from comment #1)
> (In reply to pcfreak115 from comment #0)
> > STEPS TO REPRODUCE
> > 1. Create column with numbers in it
> > 2. Create second column in another table
> > 3. For this second column, use Generate Data and then the function stdev(x),
> > where for x the column from step 1 is used
> > 
> > OBSERVED RESULT
> > The second column contains exactly as many entries as there are in the first
> > column
> > [...]
> Right now we adjust the size of the target column/spreadsheet which makes
> sense in very many cases. But your points are valid, of course. In case
> spreadsheets with different sizes are involved, we should warn the user and
> also allow to decide whether the sizes should be adjusted or not. We'll
> implement this.
> 
> May I ask you about your scenario? Why do you need to have the standard
> deviation 50 times?

In my scenario I recorded some background noise with my experiment and put all the values into one column. Then I did the actual measurement but with a smaller number of samples. I then wanted to use the standard deviation of the background noise as error for my actual measurements. And as far as I know, i have to fill another column parallel to the measurements with the errors so in a plot all my measurements get their proper error bar. 

This is, of course, a relatively simple case where I could also just copy the value of the standard deviation and generate constant values in the error column. But I believe in more complicated cases this might also become a lot more inconvenient. (Think automated templates which can be reused if an experiment is repeated without having to manually copy some specific value, etc...) 

Also, for the usual functions (i.e. Number -> Number, just applied to all entries, like sqrt(x)) the resizing behavior, i.e. that the amount of entries in the target column is set by the source column, makes somewhat sense to me. 
But in this case stdev is a function Column -> Number, where I believe the amount of entries in the target column should be set by the target. Composite/chained functions like stdev(x)*sqrt(x)*y*stdev(x*y) where x and y correspond to differently sized columns/tables might make this more complicated, though  (Not something i actually need, just another example. Speaking of which, I haven't checked if something like stdev(x*y) is even possible right now?***). In this example i would expect the target size to be expanded to fit sqrt(x)*y, the standard deviations should just act as scalers to the entire target column.


*** stdev(x*y) doesn't make sense for any particular values of x and y, because this would just be 0, but rather for x and y as references to columns. While writing this I checked, and I get the first behavior, i.e. stdev(x*y) = 0, which is kinda inconsistent behavior? stdev(x) is the standard deviation of the entire column referenced by x while stdev(x*y) is the standard deviation of the single value x*y. However, stdev(x*y) could be rephrased with another column z which is generated by a function z=x*y, and then stdev(z) has not the same result as stdev(x*y). I believe this could be another bug, which is probably common to all functions Column -> Number.

It seems like functions which are not "bijective" with regard to the index in the source- and targetcolumns make the generate function system a lot more complicated... I have some more thoughts on this, but they are kinda half-baked so I don't want to share them right now.

Sorry for the wall of text. I hope it makes sense, at least.
Comment 3 pcfreak115 2023-09-24 22:07:21 UTC
Created attachment 161846 [details]
The case in which the table is unnecassarily expanded

The case in which the table is unnecassarily expanded. (I suppose normal behavior for something like sqrt(x), but annoying for stdev(x))
Comment 4 pcfreak115 2023-09-24 22:08:33 UTC
Created attachment 161847 [details]
The case in which the table has not enough error values.

The case in which the table has not enough error values. (I suppose normal behavior for something like sqrt(x), but very problematic in this case/for stdev(x))
Comment 5 Alexander Semke 2023-09-27 06:51:21 UTC
(In reply to pcfreak115 from comment #3)
> Created attachment 161846 [details]
> The case in which the table is unnecassarily expanded
> 
> The case in which the table is unnecassarily expanded. (I suppose normal
> behavior for something like sqrt(x), but annoying for stdev(x))

(In reply to pcfreak115 from comment #2)
> (In reply to Alexander Semke from comment #1)
> > (In reply to pcfreak115 from comment #0)
> > > STEPS TO REPRODUCE
> > > 1. Create column with numbers in it
> > > 2. Create second column in another table
> > > 3. For this second column, use Generate Data and then the function stdev(x),
> > > where for x the column from step 1 is used
> > > 
> > > OBSERVED RESULT
> > > The second column contains exactly as many entries as there are in the first
> > > column
> > > [...]
> > Right now we adjust the size of the target column/spreadsheet which makes
> > sense in very many cases. But your points are valid, of course. In case
> > spreadsheets with different sizes are involved, we should warn the user and
> > also allow to decide whether the sizes should be adjusted or not. We'll
> > implement this.
> > 
> > May I ask you about your scenario? Why do you need to have the standard
> > deviation 50 times?
> 
> In my scenario I recorded some background noise with my experiment and put
> all the values into one column. Then I did the actual measurement but with a
> smaller number of samples. I then wanted to use the standard deviation of
> the background noise as error for my actual measurements. And as far as I
> know, i have to fill another column parallel to the measurements with the
> errors so in a plot all my measurements get their proper error bar. 
Ok. This is a valid scenario. We're going to add a new option "Auto resize the column" to the dialog (more explanation provided in the tooltip text for it) and the user can do case-by-case decisions. See the screenshot attached.


> This is, of course, a relatively simple case where I could also just copy
> the value of the standard deviation and generate constant values in the
> error column. But I believe in more complicated cases this might also become
> a lot more inconvenient. (Think automated templates which can be reused if
> an experiment is repeated without having to manually copy some specific
> value, etc...) 
Since you mentioned templates, we have already templates for plots and we plan to extend this concept and to implement also templates for projects - this might come handy in situations like you described with repeated experiments, etc.
Comment 6 Alexander Semke 2023-09-27 06:52:12 UTC
Created attachment 161905 [details]
new option to auto resize the target column
Comment 7 Martin 2023-10-31 07:20:36 UTC
This problem was fixed in:
https://invent.kde.org/education/labplot/-/merge_requests/373

Can you test with the latest development release if it fits your needs?
Comment 8 pcfreak115 2023-10-31 08:58:37 UTC
(In reply to Martin from comment #7)
> This problem was fixed in:
> https://invent.kde.org/education/labplot/-/merge_requests/373
> 
> Can you test with the latest development release if it fits your needs?

This works, thanks! 
Although, if the target column is smaller than my source column, the target column is unnecessarily expanded.
Also, if i use the automatic recalculation feature and then manually expand my target, the new cells are left empty. Here i would expect the automatic recalculation to also fill these new cells.
Comment 9 Martin 2023-11-09 20:50:08 UTC
Which system are you working on?

Because I found a bug when trying to use stdev(x). Are you able to generate data?

I will fix the other issue as well
Comment 10 pcfreak115 2023-11-13 21:03:07 UTC
(In reply to Martin from comment #9)
> Which system are you working on?
> 
> Because I found a bug when trying to use stdev(x). Are you able to generate
> data?
> 
> I will fix the other issue as well

I am working on a Manjaro-Gnome system. As far as I can tell, stdev(x) works fine for me.
Comment 11 Martin 2023-11-13 21:24:08 UTC
Now everything is merged. You can test it now, the resize was implemented resently
Comment 12 pcfreak115 2023-11-21 15:42:45 UTC
(In reply to Martin from comment #11)
> Now everything is merged. You can test it now, the resize was implemented
> resently

It seems like when increasing the number of rows of a table using the arrow buttons, the last cell stays empty and is not filled (tested with stdev).

Just entering a number in the row field does extend the table but does not recalculate the missing values at all.
Comment 13 Martin 2023-11-21 18:10:56 UTC
Can you have at the description of this MR: https://invent.kde.org/education/labplot/-/merge_requests/414
I think it solves that problem as well