Directives in comments
The starting point of the script to generate test data is the SQL schema of the database taken from a file. It includes important information that will be used to generate data: attribute types, uniqueness, not-null-ness, foreign keys… The idea is to augment the schema with directives in comments so as to provide additional hints about data generation.
A datafiller directive is a special SQL comment recognized by the script, with a
df marker at the beginning. A directive must appear after the object about which it is applied, either directly after the object declaration, in which case the object is implicit, or much later, in which case the object must be explicitely referenced:
-- this directive sets the default overall size -- df: size=10 -- this directive defines a macro named "fn" -- df fn: words=/path/to/file-containing-words -- this directive applies to table `Foo` CREATE TABLE Foo( -- df: mult=10.0 -- this directive applies to attribute `fid` fid SERIAL -- df: offset=1000 -- use defined macro: stuff taken out from the list of words , stuff TEXT NOT NULL -- df: use=fn ); -- ... much later -- this directive applies to attribute `fid` in table `Foo` -- df T=Foo A=fid: mangle
A simple example
Let us start with a simple example involving a library where readers borrow books. Our schema is defined in file
library.sql as follows:
CREATE TABLE Book( bid SERIAL PRIMARY KEY, title TEXT NOT NULL ); CREATE TABLE Reader( rid SERIAL PRIMARY KEY, firstname TEXT NOT NULL, lastname TEXT NOT NULL, born DATE NOT NULL, gender BOOLEAN NOT NULL, phone TEXT -- nullable, maybe no phone ); CREATE TABLE Borrow( borrowed TIMESTAMP NOT NULL, rid INTEGER NOT NULL REFERENCES Reader, bid INTEGER NOT NULL REFERENCES Book, PRIMARY KEY(bid) -- a book is borrowed at most once at a time! );
The first and only information we really need to provide is the relative or absolute size of relations. For scaling easily, the best way is to specify a relative size multiplier with the
mult directive on each table, which will be multiplied by the
size option to compute the actual size of data to generate in each table. Let us say we want 100 books in stock per reader, with 1.5 borrowed books per reader on average:
CREATE TABLE Book( -- df: mult=100.0 ... CREATE TABLE Borrow( --df: mult=1.5
The default multiplier is 1.0, it does not need to be set on
Reader. Then we can generate a data set with:
datafiller.py --size=1000 library.sql > library_test_data.sql
Note that explicit constraints are enforced on the generated data, so that foreign keys in Borrow reference existing Books and Readers. However, the script cannot guess implicit constraints, thus if an attribute is not declared
NOT NULL, then some
NULL values will be generated. If an attribute is not unique, then the generated values will probably not be unique.
Improving generated values
If we look at the above generated data, some attributes may not reflect the reality one would expect from our library. Changing the default with per-attribute directives will help improve this first result.
First, book titles are all quite short, looking like title_number, including some collisions. Indeed the default is to generate strings with a prefix based on the attribute name and a number drawn uniformly from the expected size of the relation. We can change this to texts composed of 1 to 7 English words taken from a dictionary:
title TEXT NOT NULL -- df: text word=/etc/dictionaries-common/words length=4 lenvar=3
If we now look at the generated readers, the result can also be improved. First, we can decide to keep the prefix and number form, but make the statistics more in line with what one expects. Let us draw from 1000 firstnames, most frequent 3%, and 10000 lastnames, most frequent 1%:
firstname TEXT NOT NULL, -- df: prefix=fn size=1000 gen=power rate=0.03 lastname TEXT NOT NULL, -- df: prefix=ln size=10000 gen=power rate=0.01
The default generated dates are a few days around now, which does not make much sense for our readers’ birth dates. Let us set a range for these dates:
birth DATE NOT NULL, -- df: start=1923-01-01 end=2010-01-01
Most readers from our library are female: we can adjust the rate so that 25% of readers are male, instead of the default 50%.
gender BOOLEAN NOT NULL, -- df: rate=0.25
Phone numbers also have a prefix_number structure, which does not really look like a phone number. Let us draw a string of 10 digits, and adjust the nullable rate so that 1% of phone numbers are not known. We also set the size manually to avoid too many collisions, but we could have chosen to keep them as is, as some readers do share phone numbers.
phone TEXT -- these directives could be on a single line -- df: chars='0123456789' length=10 lenvar=0 -- df: null=0.01 size=1000000
The last table is about borrowed books. The timestamps are around now, we are going to spread them on a period of 50 days, that is 24 * 60 * 50 = 72000 minutes, with 60 seconds precision.
borrowed TIMESTAMP NOT NULL -- df: size=72000 prec=60
Because of the unique constraint, the borrowed books are by default the first ones. Let us mangle the result so that the book numbers are scattered.
bid INTEGER REFERENCES Book -- df: mangle
Now we can generate improved data for our one thousand readers library, and fill it directly to the
datafiller.py --size=1000 --filter | psql library
Our test database is ready. If we want more users and books, we only need to adjust the
size option. Let us query on our test data:
-- show firstname distribution SELECT firtname, COUNT(*) AS cnt FROM Reader GROUP BY firstname ORDER BY cnt DESC LIMIT 3; -- fn_1_... | 33 -- fn_2_... | 15 -- fn_3_... | 12 -- compute gender rate SELECT AVG(CASE WHEN gender THEN 1.0 ELSE 0.0 END) FROM Reader; -- 0.246000
We could go on improving the generated data so that it is more realistic. For instance, we could skew the borrowed timestamp so that there are less old borrowings, or skew the book number so that old books (lower numbers) are less often borrowed, or choose firtnames and lastnames from actual lists. For very application-specific constraints that would not fit the generators, it is also possible to apply updates to modify the generated data afterwards.
When to stop improving is not obvious: On the one hand, real data may show particular distributions that impact on application behavior and performance, thus it may be important to reflect that in the test data. On the other hand, if nothing is really done about readers, then maybe the only relevant information is the average length of firstnames and lastnames because of the storage implications, and that’s it…
There are many more directives to drive data generation, see the online or embedded
datafiller.py --man documentation, including the
comics didactic example with
See the previous post for another example based on