DataFiller Tutorial

This tutorial introduces how to use DataFiller to fill a PostgreSQL database, say for testing functionalities and performances.

Directives in comments

The starting point of the script to generate test data is the SQL schema of the database taken from a file. It includes important information that will be used to generate data: attribute types, uniqueness, not-null-ness, foreign keys… The idea is to augment the schema with directives in comments so as to provide additional hints about data generation.

A datafiller directive is a special SQL comment recognized by the script, with a df marker at the beginning. A directive must appear after the object about which it is applied, either directly after the object declaration, in which case the object is implicit, or much later, in which case the object must be explicitely referenced:

-- this directive sets the default overall size
  -- df: size=10
-- this directive defines a macro named "fn"
  -- df fn: words=/path/to/file-containing-words
-- this directive applies to table `Foo`
CREATE TABLE Foo( -- df: mult=10.0
  -- this directive applies to attribute `fid`
  fid SERIAL -- df: offset=1000
  -- use defined macro: stuff taken out from the list of words
, stuff TEXT NOT NULL -- df: use=fn
);
-- ... much later
-- this directive applies to attribute `fid` in table `Foo`
-- df T=Foo A=fid: mangle

A simple example

Let us start with a simple example involving a library where readers borrow books. Our schema is defined in file library.sql as follows:

CREATE TABLE Book(
  bid SERIAL PRIMARY KEY,
  title TEXT NOT NULL
);

CREATE TABLE Reader(
  rid SERIAL PRIMARY KEY,
  firstname TEXT NOT NULL,
  lastname TEXT NOT NULL,
  born DATE NOT NULL,
  gender BOOLEAN NOT NULL,
  phone TEXT -- nullable, maybe no phone
);

CREATE TABLE Borrow(
  borrowed TIMESTAMP NOT NULL,
  rid INTEGER NOT NULL REFERENCES Reader,
  bid INTEGER NOT NULL REFERENCES Book,
  PRIMARY KEY(bid) -- a book is borrowed at most once at a time!
);

The first and only information we really need to provide is the relative or absolute size of relations. For scaling easily, the best way is to specify a relative size multiplier with the mult directive on each table, which will be multiplied by the size option to compute the actual size of data to generate in each table. Let us say we want 100 books in stock per reader, with 1.5 borrowed books per reader on average:

CREATE TABLE Book( -- df: mult=100.0
...
CREATE TABLE Borrow( --df: mult=1.5

The default multiplier is 1.0, it does not need to be set on Reader. Then we can generate a data set with:

datafiller.py --size=1000 library.sql > library_test_data.sql

Note that explicit constraints are enforced on the generated data, so that foreign keys in Borrow reference existing Books and Readers. However, the script cannot guess implicit constraints, thus if an attribute is not declared NOT NULL, then some NULL values will be generated. If an attribute is not unique, then the generated values will probably not be unique.

Improving generated values

If we look at the above generated data, some attributes may not reflect the reality one would expect from our library. Changing the default with per-attribute directives will help improve this first result.

First, book titles are all quite short, looking like title_number, including some collisions. Indeed the default is to generate strings with a prefix based on the attribute name and a number drawn uniformly from the expected size of the relation. We can change this to texts composed of 1 to 7 English words taken from a dictionary:

  title TEXT NOT NULL
  -- df: text word=/etc/dictionaries-common/words length=4 lenvar=3

If we now look at the generated readers, the result can also be improved. First, we can decide to keep the prefix and number form, but make the statistics more in line with what one expects. Let us draw from 1000 firstnames, most frequent 3%, and 10000 lastnames, most frequent 1%:

  firstname TEXT NOT NULL,
     -- df: prefix=fn size=1000 gen=power rate=0.03
  lastname TEXT NOT NULL,
     -- df: prefix=ln size=10000 gen=power rate=0.01

The default generated dates are a few days around now, which does not make much sense for our readers’ birth dates. Let us set a range for these dates:

  birth DATE NOT NULL, -- df: start=1923-01-01 end=2010-01-01

Most readers from our library are female: we can adjust the rate so that 25% of readers are male, instead of the default 50%.

  gender BOOLEAN NOT NULL, -- df: rate=0.25

Phone numbers also have a prefix_number structure, which does not really look like a phone number. Let us draw a string of 10 digits, and adjust the nullable rate so that 1% of phone numbers are not known. We also set the size manually to avoid too many collisions, but we could have chosen to keep them as is, as some readers do share phone numbers.

  phone TEXT
    -- these directives could be on a single line
    -- df: chars='0123456789' length=10 lenvar=0
    -- df: null=0.01 size=1000000

The last table is about borrowed books. The timestamps are around now, we are going to spread them on a period of 50 days, that is 24 * 60 * 50 = 72000 minutes, with 60 seconds precision.

  borrowed TIMESTAMP NOT NULL -- df: size=72000 prec=60

Because of the unique constraint, the borrowed books are by default the first ones. Let us mangle the result so that the book numbers are scattered.

  bid INTEGER REFERENCES Book -- df: mangle

Now we can generate improved data for our one thousand readers library, and fill it directly to the library database:

datafiller.py --size=1000 --filter | psql library

Our test database is ready. If we want more users and books, we only need to adjust the size option. Let us query on our test data:

-- show firstname distribution
SELECT firtname, COUNT(*) AS cnt FROM Reader
GROUP BY firstname ORDER BY cnt DESC LIMIT 3;
  -- fn_1_... | 33
  -- fn_2_... | 15
  -- fn_3_... | 12
-- compute gender rate
SELECT AVG(CASE WHEN gender THEN 1.0 ELSE 0.0 END) FROM Reader;
  -- 0.246000

Conclusion

We could go on improving the generated data so that it is more realistic. For instance, we could skew the borrowed timestamp so that there are less old borrowings, or skew the book number so that old books (lower numbers) are less often borrowed, or choose firtnames and lastnames from actual lists. For very application-specific constraints that would not fit the generators, it is also possible to apply updates to modify the generated data afterwards.

When to stop improving is not obvious: On the one hand, real data may show particular distributions that impact on application behavior and performance, thus it may be important to reflect that in the test data. On the other hand, if nothing is really done about readers, then maybe the only relevant information is the average length of firstnames and lastnames because of the storage implications, and that’s it…

There are many more directives to drive data generation, see the online or embedded datafiller.py --man documentation, including the comics didactic example with datafiller.py --test=comics.

The best way to take the feel of DataFiller is to play with it: download the library.sql file to test it!

See the previous post for another example based on pgbench.

rule