Header Ads

Random sampling in postgres using TABLESAMPLE

 When analyzing data or running analytics sometimes we need to get a random sample of the dataset. There are multiple ways of doing this in a database. One naive approach is to get the all rows in the table and generate random numbers to select random  rowids. Many databases including SQL server and Postgresql have a feature called TABLESAMPLE which does the random sampling for you without need to fetch all rows and select sample randomly.

for example if we want to select random 1% sample from employee table we need to run following query.


select * from employee TABLESAMPLE  SYSTEM(1)  REPEATABLE (123) ;


there are 2 methods of sampling SYSTEM and BERNOULLI.

The SYSTEM method reads random blocks and returns rows stored on those blocks. BERNOULLI method scans the table and selects rows randomly.

REPEATABLE  clause is used to reproduce random sampling. If the table is not changed same seed will ensure the same set of rows in the sample.

This feature can be used in ETL pipelines to implement shuffle and sample features without reading all the rows from a database.




No comments

Powered by Blogger.