Directives in comments
The starting point of the script to generate test data is the SQL schema of the database taken from a file. It includes important information that will be used to generate data: attribute types, uniqueness, not-null-ness, foreign keys… The idea is to augment the schema with directives in comments so as to provide additional hints about data generation.
A datafiller directive is a special SQL comment recognized by the script, with a
df marker at the beginning. A directive must appear after the object about which it is applied, either directly after the object declaration, in which case the object is implicit, or much later, in which case the object must be explicitely referenced:
A simple example
Let us start with a simple example involving a library where readers borrow books. Our schema is defined in file
library.sql as follows:
The first and only information we really need to provide is the relative or absolute size of relations. For scaling easily, the best way is to specify a relative size multiplier with the
mult directive on each table, which will be multiplied by the
size option to compute the actual size of data to generate in each table. Let us say we want 100 books in stock per reader, with 1.5 borrowed books per reader on average:
The default multiplier is 1.0, it does not need to be set on
Reader. Then we can generate a data set with:
Note that explicit constraints are enforced on the generated data, so that foreign keys in Borrow reference existing Books and Readers. However, the script cannot guess implicit constraints, thus if an attribute is not declared
NOT NULL, then some
NULL values will be generated. If an attribute is not unique, then the generated values will probably not be unique.
Improving generated values
If we look at the above generated data, some attributes may not reflect the reality one would expect from our library. Changing the default with per-attribute directives will help improve this first result.
First, book titles are all quite short, looking like title_number, including some collisions. Indeed the default is to generate strings with a prefix based on the attribute name and a number drawn uniformly from the expected size of the relation. We can change this to texts composed of 1 to 7 English words taken from a dictionary:
If we now look at the generated readers, the result can also be improved. First, we can decide to keep the prefix and number form, but make the statistics more in line with what one expects. Let us draw from 1000 firstnames, most frequent 3%, and 10000 lastnames, most frequent 1%:
The default generated dates are a few days around now, which does not make much sense for our readers’ birth dates. Let us set a range for these dates:
Most readers from our library are female: we can adjust the rate so that 25% of readers are male, instead of the default 50%.
Phone numbers also have a prefix_number structure, which does not really look like a phone number. Let us draw a string of 10 digits, and adjust the nullable rate so that 1% of phone numbers are not known. We also set the size manually to avoid too many collisions, but we could have chosen to keep them as is, as some readers do share phone numbers.
The last table is about borrowed books. The timestamps are around now, we are going to spread them on a period of 50 days, that is 24 * 60 * 50 = 72000 minutes, with 60 seconds precision.
Because of the unique constraint, the borrowed books are by default the first ones. Let us mangle the result so that the book numbers are scattered.
Now we can generate improved data for our one thousand readers library, and fill it directly to the
Our test database is ready. If we want more users and books, we only need to adjust the
size option. Let us query on our test data:
We could go on improving the generated data so that it is more realistic. For instance, we could skew the borrowed timestamp so that there are less old borrowings, or skew the book number so that old books (lower numbers) are less often borrowed, or choose firtnames and lastnames from actual lists. For very application-specific constraints that would not fit the generators, it is also possible to apply updates to modify the generated data afterwards.
When to stop improving is not obvious: On the one hand, real data may show particular distributions that impact on application behavior and performance, thus it may be important to reflect that in the test data. On the other hand, if nothing is really done about readers, then maybe the only relevant information is the average length of firstnames and lastnames because of the storage implications, and that’s it…
There are many more directives to drive data generation, see the online or embedded
datafiller.py --man documentation, including the
comics didactic example with
See the previous post for another example based on