Bug 312646 - Filesize rounding and significant digits...
Summary: Filesize rounding and significant digits...
Status: REPORTED
Alias: None
Product: kdelibs
Classification: Frameworks and Libraries
Component: klocale (show other bugs)
Version: 4.9.5
Platform: Ubuntu Linux
: NOR wishlist
Target Milestone: ---
Assignee: Chusslove Illich
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-05 02:55 UTC by Bzzz
Modified: 2013-01-30 00:08 UTC (History)
3 users (show)

See Also:
Latest Commit:
Version Fixed In:


Attachments
conversion demo script (1.63 KB, text/x-c++src)
2013-01-30 00:08 UTC, Bzzz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bzzz 2013-01-05 02:55:48 UTC
Hi,
I'm a little irritated on how dolphin handles rounding precision when it comes to display the size of files. Command line tools such as ls (-lh) and df (-h) already use strange rules, but these intend to use as little space for the filesize column as possible. Dolphin apparently uses an algorithm that is worse. Let me explain, in case you have never heard about the concept of significant digits:

If you do a measurement in real life, you will end up with a value and, if you have had some higher education, a level of uncertainty of that value. For example, you may gauge a small distance with a carpenter's rule and a micrometer (the tool, not the unit). The results are 3.6 cm and 3.600 cm. Is that the same? No! The first one has two significant digits and includes a large error of let's say 0.2 cm, so the real distance may be something between 3.4 and 3.8 cm. The second one has four significant digits and the tool only yields an error of 0.002 cm, so the measured distance is something between 3.598 and 3.602 cm. So the latter is a far more precise information about the thing that was surveyed. Sadly, any tool has its limitations, so the rule has a large span but little accuracy, far too bad for any watchmaker, while the micrometer has good accuracy but a very limited reach of action, far too small for any carpenter. A scanning electron microscope may resolve single atoms, but cannot measure the distance to the moon, while parallax measurement can do that, with a horrible amount of minimal error.

Fortunately, in the computer world there is a smallest possible amount of data, and that is one bit. Furthermore, we are almost always capable of telling the exact size of a chunk of data, meaning atomic resolution or as many significant digits as we need to tell the exact value. So if we have exactly 1 MiB (±0), we are not wrong in translating that to 1.000 MiB (I stay with binary prefixes like dolphin does), 1024 KiB, 1024.00 KiB or 1048576 Bytes.

Now, what does that have to do with dolphin? Set up an empty file and grow it. It starts counting 1 B, 2B,...100 B, ...999 B, 1 000 B, ... 1 023 B. We have four significant digits at that point.
Give it a single byte more. Dolphin now states 1.0 KiB. That's only two digits. We've lost two orders of magnitude (~99%) of our accuracy because the file is now 1/1023th (~0,0098%) bigger than before. That's like counting up to a meter in steps of millimeters (1m=1000mm), and beyond that in decimeters (1m=10dm). No one would want to lose that much precision at once.
Go further. Dolphin displays 1.1 KiB, 1.2 KiB, ...100.0 KiB....1 023.9 KiB. That's now five significant digits. Give it a shot more -> 1.0 MiB. Again, that's like counting up to a meter, this time in tenths of a millimeter (!), and after that point in decimeters. A horrible loss in precision.
Same for every upcoming prefix change, like MiB->GiB, GiB->TiB and so on.
Or have a look at what these figures mean. 1023.9 KiB states that, using common rounding rules, the file is not 1023.8 KiB and not 1024.0 KiB in size. So something between 1023.85 KiB and 1023.94999.. KiB, translating to 1,048,422 to 1,048,525 Bytes, a range (or potential "error") of just 103 Bytes. A quite narrow bin, only 103 different-sized files fit in until one size has to be there twice.
1.0 MiB however says that the file is larger or equal to 0.95 MiB (otherwise it would be 0.9 MiB or less), and smaller than 1.04999... MiB. This translates to 996,147 to 1,101,005 Bytes, a range of 104.858 Bytes. Again, we have atomic precision, so we KNOW the exact size, but we don't DISPLAY it!

Why is that the case? I understand that displaying 1.673742799 GiB spoils the fun of having prefixes that can limit the needed column width very effective. This level of accuracy isn't needed for files that large, and as we have an elementary unit of one Bit (or Byte), this level doesn't even make sense for smaller files. I also accept that having one decimal place for any number looks nice and well-arranged. But even this is flawed: Files below 1024 Bytes do not have that decimal place, and as most of us use fonts with a non-fixed width, filesizes in GiB do not appear exactly below files that are measured in MiB, because the G uses more space than the M, and therefore the figure is slightly displaced. The displacement however is not large enough (except for Byte-sized files, which lack two characters in the unit descriptor) that one may depict at first glance that files differ thousandfold or more in size.

I don't know if there has been a discussion about this, probably very few people care about the number format in a file browser. But no real scientist can afford to be lazy about significant digits! ;) I hereby propose three digits for any size indication, maybe user-tweakable to four. Files below 100 Bytes do not gain decimal places or leading zeroes. I'll use that format for the examples that will appear below.

Reproducible: Always

Steps to Reproduce:
1. open dolphin
2. browse to a folder with many different-sized files (or create some yourself)
3. look at the strange format in the "size" column
Actual Results:  
3 B
973 B
1 023 B
1.9 KiB
19.9 KiB
200.0 KiB
1 023.0 KiB
1.0 MiB
3.0 GiB
1 003.4 GiB
7.4 TiB
et cetera...

Expected Results:  
3 B
973 B
1.00 KiB
1.94 KiB
19.9 KiB
200 KiB
1.00  MiB
1.03  MiB
2.99   GiB
0.98    TiB
7.40    TiB
Note that the column text in dolphin is right-aligned, which leads to aligned "B"s, and a larger feed of the numbers, the larger the prefix is.
Note also that for Byte and Kibibyte, the numeral is separated with a single space from the unit. For any higher prefix, another space is added per 1024x increase (MiB: two, GiB: three, TiB: four spaces, ...), indicating a larger magnitude of size without wasting much width. Color-coding or slight variations in font size or boldness may also be a viable way to emphasize the unit changes

Yes, I know there are more important things to fix! ;)
Comment 1 Frank Reininghaus 2013-01-05 14:18:04 UTC
Thanks for the detailed report. This isn't a Dolphin issue, the formatting of file sizes is done by KLocale::formatByteSize():

http://api.kde.org/4.x-api/kdelibs-apidocs/kdecore/html/classKLocale.html#a3116ebc59ad98cbf80dc06ff9489e330

> Yes, I know there are more important things to fix! ;)

Yes indeed ;-) Nobody else ever complained about this AFAIK, so I think the only way to change the behaviour in the near future might be that you dig into the code and submit a kdelibs patch to ReviewBoard.
Comment 2 Chusslove Illich 2013-01-05 14:57:18 UTC
Interestingly, I to have been annoyed by this behavior and for exactly the
reason Bzzz states, but never sufficently annoyed to start a discussion
about it. (Maybe because when I really need to make size-based decisions, I
usually revert to du and explicit unit selection...). I can only say I'd be
all in favor of the change, and I'm adding current maintainer of KLocale to
CC list.
Comment 3 Christoph Feck 2013-01-06 01:51:16 UTC
I really wished there was an option to "spoil the fun" and just display the exact number of bytes, thousands separated by spaces, so that you can easily see the magnitude.
Comment 4 John Layt 2013-01-14 14:44:01 UTC
It looks like Dolphin uses the default formatByteSize() method which defaults to 1 decimal place and the default algorithm described below, rather than newer api which allows you to set the required significant digits and units to use (http://api.kde.org/4.x-api/kdelibs-apidocs/kdecore/html/classKLocale.html#a566a3d868abfca050e1191b16e85ed73).  Perhaps a quick initial fix will be for Dolphin to define 2 decimal places?

Here's a simplified version of the current code:

QString KLocalePrivate::formatByteSize(double bytes, int precision, BinaryUnitDialect dialect, BinarySizeUnits units)
{
    double value = bytes
    int magnitude = 0;
    double multiplier = 1024.0;

    if (dialect == MetricBinaryDialect) {
        multiplier = 1000.0;
    }

    if (units == DefaultBinaryUnits) {
        while (value >= multiplier && magnitude < maxMagnitude) {
            value = value / multiplier;
            magnitude = magnitude + 1;
        }
    } else {
        magnitude = units;
        if (magnitude > 0) {
            value = value / pow(multiplier, magnitude);
        }
    }

    if (magnitude == 0) {
        // Bytes, no rounding
        numString = formatNumber(values, 0);
    } else {
        numString = formatNumber(value, precision);
    }

    return translateUnits(dialect, magnitude, numString);
}

There are two key points to the algorithm:
* If the base units, i.e. bytes < multiplier, then the precision is ignored, i.e. no decimal places
* It moves up to the next magnitude once it goes over the magnitude, and not before

The question is, under what conditions do you decide to go to the next magnitude of units, 1% before, 5% before?  And does having a decimal place on the bytes make any logical sense when it will always be .0?  What works in some uses will not be suitable for others, i.e. a single value in the middle of a sentence versus a list view in Dolphin, and the default has to try cater for the most common use case, leaving special cases to write their own code using the advanced api.  I also am reluctant to change the current default behaviour as many other apps could be depending on the way it currently works.  You also have to remember that Dolphin is designed as a simpler style of file manager for 90% of use cases, any solution would need to target most users, and in most cases a rough idea of file size is all that is needed.

I'm a little unclear on the exact behaviour you'd prefer, if you could write a simple algorithm in words for how you think it should behave?  Perhaps Dolphin could then implement the algorithm itself for its particular use case, or we could extend the api to allow the behaviour to be controlled?
Comment 5 Bzzz 2013-01-14 15:15:56 UTC
(In reply to comment #4)

> The question is, under what conditions do you decide to go to the next
> magnitude of units, 1% before, 5% before?  
As soon as it reaches the fourth place before the decimal point/comma. Because we use 2^10 counts until we switch units, that cannot be a fixed percentage (as percentage uses a system based on 10^x). I guess that has to be 1 - (1000/1024)^x, with x being magnitude, e.g. 1 for bytes, 2 for Kibibytes, and so on. That calculates the difference between any binary and decimal unit.
For example, 1 Gibibyte has 1,073,741,824 bytes. The condition needs to be in effect at 1,000,000,000 bytes, while 999,999,999 is still okay. 1-(1000/1024)^3 equals 0.93132.., multiplied with 1 GiB that is exactly 1,000,000,000 Bytes or 1 GB. So in this case, it would be 6,868..% before the current switching point. But I think it is easier to implement something that just uses the next magnitude when reaching a certain number of digits, because all of these changes happen at 1+3y digits (y=1 for 1000, y=2 for 1000000, and so on)

> And does having a decimal place
> on the bytes make any logical sense when it will always be .0?  
For quantities of just bytes, that is below 1024 bytes, the .0 doesn't make sense (I think I said that), as there is no fraction of a byte. For any larger numbers, the decimal place does make sense.

> I'm a little unclear on the exact behaviour you'd prefer, if you could write
> a simple algorithm in words for how you think it should behave?
I'll try that asap

> Perhaps
> Dolphin could then implement the algorithm itself for its particular use
> case, or we could extend the api to allow the behaviour to be controlled?
As you said, other programs might depend on the current implementation. Therefore an API extension would be the best way to deal with that, without breaking other code.
Comment 6 Bzzz 2013-01-30 00:08:23 UTC
Created attachment 76797 [details]
conversion demo script

Guess I was wrong in determining the point where to switch units. Of course one can use for example 999 (instead of 931) GiB before the unit has to be TiB. It's just that the TiB numbering won't start at 1.00, but rather at 0.9somewhat TiB.

Well, first code proposal is the following one. Note that it doesn't really make use of the const maxdigits, so atm output is fixed to three digits (I know that 0.9x to 0.99 has not three significant figures like any other number above 1 and below 999, but I think that is a reasonable compromise to keep things simple). Implementing this would allow for two additional fine-tuning options:
* granularity: numbers like 0.9939 TiB (more digits) or 0.1 GiB, 1,0 GiB, 10 GiB, 0.1 TiB (less digits)
* Like Christoph Feck proposed, larger amounts before switching units, like 9999 MiB -> MiB overflow -> 9,766 GiB, 99,99GiB, 999,9GiB, 9999GiB -> GiB overflow ->, including the extremal setting of never even leaving the Byte unit and therefore displaying any given filesize in bytes.