1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230

.TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/"
.SH NAME
\fBent\fR \ pseudorandom number sequence test
.PP
This page describes a program, \fBent\fR, which applies various tests to
sequences of bytes stored in files and reports the results of those tests.
The program is useful for those evaluating pseudorandom number generators
for encryption and statistical sampling applications, compression
algorithms, and other applications where the information density of a file
is of interest.
.SH SYNOPSIS
\fBent\fR [ \bcftu ] [ \fIinfile\fR ]
.SH DESCRIPTION
\fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or
standard input if no \fIinfile\fR is specified) and produces output as follows
on the standard output stream:
.PP
.nf
Entropy = 7.980627 bits per character.
Optimum compression would reduce the size
of this 51768 character file by 0 percent.
Chi square distribution for 51768 samples is 1542.26, and randomly
would exceed this value 0.01 percent of the times.
Arithmetic mean value of data bytes is 125.93 (127.5 = random).
Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
.fi
.PP
The values calculated are as follows:
.PP
Entropy
.PP
The information density of the contents of the file, expressed as
a number of bits per character. The results above, which resulted
from processing an image file compressed with JPEG, indicate that
the file is extremely dense in information  essentially random.
Hence, compression of the file is unlikely to reduce its size. By
contrast, the C source code of the program has entropy of about
4.9 bits per character, indicating that optimal compression of the
file would reduce its size by 38%. \fB[Hamming, pp. 104108]\fR
.PP
Chisquare Test
.PP
The chisquare test is the most commonly used test for the
randomness of data, and is extremely sensitive to errors in
pseudorandom sequence generators. The chisquare distribution is
calculated for the stream of bytes in the file and expressed as an
absolute number and a percentage which indicates how frequently a
truly random sequence would exceed the value calculated. We
interpret the percentage as the degree to which the sequence
tested is suspected of being nonrandom. If the percentage is
greater than 99% or less than 1%, the sequence is almost certainly
not random. If the percentage is between 99% and 95% or between 1%
and 5%, the sequence is suspect. Percentages between 90% and 95%
and 5% and 10% indicate the sequence is "almost suspect". Note
that our JPEG file, while very dense in information, is far from
random as revealed by the chisquare test.
.PP
Applying this test to the output of various pseudorandom sequence
generators is interesting. The loworder 8 bits returned by the
standard Unix rand() function, for example, yields:
.PP
.nf
Chi square distribution for 500000 samples is 0.01, and randomly
would exceed this value 99.99 percent of the times.
.fi
.PP
While an improved generator \fB[Park & Miller]\fR reports:
.PP
.nf
Chi square distribution for 500000 samples is 212.53, and
randomly would exceed this value 95.00 percent of the times.
.fi
.PP
Thus, the standard Unix generator (or at least the loworder bytes
it returns) is unacceptably nonrandom, while the improved
generator is much better but still sufficiently nonrandom to
cause concern for demanding applications. Contrast both of these
software generators with the chisquare result of a genuine random
sequence created by timing radioactive decay events.
.PP
.nf
Chi square distribution for 32768 samples is 237.05, and
randomly would exceed this value 75.00 percent of the times.
.fi
.PP
See \fB[Knuth, pp. 3540]\fR for more information on the chisquare
test. An interactive chisquare calculator is available at this
site.
.PP
Arithmetic Mean
.PP
This is simply the result of summing the all the bytes (bits if
the \fB\b\fR option is specified) in the file and dividing by the file
length. If the data are close to random, this should be about
127.5 (0.5 for \fB\b\fR option output). If the mean departs from this
value, the values are consistently high or low.
.PP
Monte Carlo Value for Pi
.PP
Each successive sequence of six bytes is used as 24 bit X and Y
coordinates within a square. If the distance of the
randomlygenerated point is less than the radius of a circle
inscribed within the square, the sixbyte sequence is considered a
"hit". The percentage of hits can be used to calculate the value
of Pi. For very large streams (this approximation converges very
slowly), the value will approach the correct value of Pi if the
sequence is close to random. A 32768 byte file created by
radioactive decay yielded:
.PP
.nf
Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
.fi
.PP
Serial Correlation Coefficient
.PP
This quantity measures the extent to which each byte in the file
depends upon the previous byte. For random sequences, this value
(which can be positive or negative) will, of course, be close to
zero. A nonrandom byte stream such as a C program will yield a
serial correlation coefficient on the order of 0.5. Wildly
predictable data such as uncompressed bitmaps will exhibit serial
correlation coefficients approaching 1. See \fB[Knuth, pp. 6465]\fR for
more details.
.SH OPTIONS
.IP \fB\b\fR
The input is treated as a stream of bits rather than of 8bit
bytes. Statistics reported reflect the properties of the
bitstream.
.IP \fB\c\fR
Print a table of the number of occurrences of each possible byte
(or bit, if the \fB\b\fR option is also specified) value, and the
fraction of the overall file made up by that value. Printable
characters in the ISO 88591 Latin1 character set are shown along
with their decimal byte values. In nonterse output mode, values
with zero occurrences are not printed.
.IP \fB\f\fR
Fold upper case letters to lower case before computing statistics.
Folding is done based on the ISO 88591 Latin1 character set, with
accented letters correctly processed.
.IP \fB\t\fR
Terse mode: output is written in Comma Separated Value (CSV)
format, suitable for loading into a spreadsheet and easily read by
any programming language. See Terse Mode Output Format below for
additional details.
.IP \fB\u\fR
Print howtocall information.
.SH FILES
If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input.
Output is always written to standard output.
.SH TERSE MODE OUTPUT FORMAT
Terse mode is selected by specifying the \fB\t\fR option on the command line.
Terse mode output is written in Comma Separated Value (CSV) format, which
can be directly loaded into most spreadsheet programs and is easily read
by any programming language. Each record in the CSV file begins with a
record type field, which identifies the content of the following fields.
If the \fB\c\fR option is not specified, the terse mode output will consist of
two records, as follows:
.PP
.nf
0,Filebytes,Entropy,Chisquare,Mean,MonteCarloPi,SerialCorrelation
1,file_length,entropy,chi_square,mean,Pi_value,correlation
.fi
.PP
where the italicised values in the type 1 record are the numerical values
for the quantities named in the type 0 column title record. If the \fB\b\fR
option is specified, the second field of the type 0 record will be
"Filebits", and the file_length field in type 1 record will be given in
bits instead of bytes. If the \fB\c\fR option is specified, additional records
are appended to the terse mode output which contain the character counts:
.PP
.nf
2,Value,Occurrences,Fraction
3,v,count,fraction
. . .
.fi
.PP
If the \fB\b\fR option is specified, only two type 3 records will appear for the
two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
one for each possible byte value. The second field of a type 3 record
indicates how many bytes (or bits) of value v appear in the input, and
fraction gives the decimal fraction of the file which has value v (which
is equal to the count value of this record divided by the file_length
field in the type 1 record).
.SH BUGS
Note that the "optimal compression" shown for the file is computed from
the byte or bitstream entropy and thus reflects compressibility based on
a reading frame of the chosen width (8bit bytes or individual bits if the
\fB\b\fR option is specified). Algorithms which use a larger reading frame, such
as the LempelZiv \fB[Lempel & Ziv]\fR algorithm, may achieve greater
compression if the file contains repeated sequences of multiple bytes.
.SH SEE ALSO
\fIIntroduction to Probability and Statistics\fR
.br
http://www.fourmilab.ch/rpkp/experiments/statistics.html
.PP
\fB[Hamming]\fR
.br
Hamming, Richard W. \fICoding and Information Theory.\fR Englewood
Cliffs NJ: PrenticeHall, 1980.
.PP
\fB[Knuth]\fR
.br
Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 /
Seminumerical Algorithms\fR. Reading MA: AddisonWesley, 1969. ISBN
0201896842.
.PP
\fB[Lempel & Ziv]\fR
.br
Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3,
pp. 337343.
.PP
\fB[Park & Miller]\fR
.br
Park, Stephen K. and Keith W. Miller. "Random Number Generators:
Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October
1988, p. 1192.
.SH COPYING
This software is in the public domain. Permission to use, copy, modify,
and distribute this software and its documentation for any purpose and
without fee is hereby granted, without any conditions or restrictions.
This software is provided "as is" without express or implied warranty.
.SH AUTHOR
John Walker
.br
October 20th, 1998
