Random sampling in postgres using TABLESAMPLE
When analyzing data or running analytics sometimes we need to get a random sample of the dataset. There are multiple ways of doing this in a database. One naive approach is to get the all rows in the table and generate random numbers to select random rowids. Many databases including SQL server and Postgresql have a feature called TABLESAMPLE which does the random sampling for you without need to fetch all rows and select sample randomly.
for example if we want to select random 1% sample from employee table we need to run following query.
select * from employee TABLESAMPLE SYSTEM(1) REPEATABLE (123) ;
there are 2 methods of sampling SYSTEM and BERNOULLI.
The SYSTEM method reads random blocks and returns rows stored on those blocks. BERNOULLI method scans the table and selects rows randomly.
REPEATABLE clause is used to reproduce random sampling. If the table is not changed same seed will ensure the same set of rows in the sample.
This feature can be used in ETL pipelines to implement shuffle and sample features without reading all the rows from a database.
Post a Comment