Filter rows
***********


By regex
========

You can use Python's builtin "re" module to introduce a regular
expression into a "Table.where()" query.

For example, here we find all states that start with "C".

   import re

   new_table = table.where(lambda row: re.match('^C', str(row['state'])))

This can also be useful for finding values that **don't** match your
expectations. For example, finding all values in the "phone number"
column that don't look like phone numbers:

   new_table = table.where(lambda row: not re.match('\d{3}-\d{3}-\d{4}', str(row['phone'])))


By glob
=======

Hate regexes? You can use glob ("fnmatch") syntax too!

   from fnmatch import fnmatch

   new_table = table.where(lambda row: fnmatch('C*', row['state']))


Values within a range
=====================

This snippet filters the dataset to incomes between 100,000 and
200,000.

   new_table = table.where(lambda row: 100000 < row['income'] < 200000)


Dates within a range
====================

This snippet filters the dataset to events during the summer of 2015:

   import datetime

   new_table = table.where(lambda row: datetime.datetime(2015, 6, 1) <= row['date'] <= datetime.datetime(2015, 8, 31))

If you want to filter to events during the summer of any year:

   new_table = table.where(lambda row: 6 <= row['date'].month <= 8)


Top N percent
=============

To filter a dataset to the top 10% percent of values we first compute
the percentiles for the column and then use the result in the
"Table.where()" truth test:

   percentiles = table.aggregate(agate.Percentiles('salary'))
   top_ten_percent = table.where(lambda r: r['salary'] >= percentiles[90])


Random sample
=============

By combining a random sort with limiting, we can effectively get a
random sample from a table.

   import random

   randomized = table.order_by(lambda row: random.random())
   sampled = table.limit(10)


Ordered sample
==============

With can also get an ordered sample by simply using the "step"
parameter of the "Table.limit()" method to get every Nth row.

   sampled = table.limit(step=10)


Distinct values
===============

You can retrieve a distinct list of values in a column using
"Column.values_distinct()" or "Table.distinct()".

"Table.distinct()" returns the entire row so it's necessary to chain a
select on the specific column.

   columns = ('value',)
   rows = ([1],[2],[2],[5])
   new_table = agate.Table(rows, columns)

   new_table.columns['value'].values_distinct()
   # or
   new_table.distinct('value').columns['value'].values()
   (Decimal('1'), Decimal('2'), Decimal('5'))
