Statistics
**********

Common descriptive and aggregate statistics are included with the core
agate library. For additional statistical methods beyond the scope of
agate consider using the agate-stats extension or integrating with
scipy.


Descriptive statistics
======================

agate includes a full set of standard descriptive statistics that can
be applied to any column containing "Number" data.

   table.aggregate(agate.Sum('salary'))
   table.aggregate(agate.Min('salary'))
   table.aggregate(agate.Max('salary'))
   table.aggregate(agate.Mean('salary'))
   table.aggregate(agate.Median('salary'))
   table.aggregate(agate.Mode('salary'))
   table.aggregate(agate.Variance('salary'))
   table.aggregate(agate.StDev('salary'))
   table.aggregate(agate.MAD('salary'))

Or, get several at once:

   table.aggregate([
       ('salary_min', agate.Min('salary')),
       ('salary_ave', agate.Mean('salary')),
       ('salary_max', agate.Max('salary')),
   ])


Aggregate statistics
====================

You can also generate aggregate statistics for subsets of data
(sometimes referred to as "rolling up"):

   doctors = patients.group_by('doctor')
   patient_ages = doctors.aggregate([
       ('patient_count', agate.Count()),
       ('age_mean', agate.Mean('age')),
       ('age_median', agate.Median('age'))
   ])

The resulting table will have four columns: "doctor", "patient_count",
"age_mean" and "age_median".

You can roll up by multiple columns by chaining agate's
"Table.group_by()" method.

   doctors_by_state = patients.group_by("state").group_by('doctor')


Distribution by count (frequency)
=================================

Counting the number of each unique value in a column can be
accomplished with the "Table.pivot()" method:

   # Counts of a single column's values
   table.pivot('doctor')

   # Counts of all combinations of more than one column's values
   table.pivot(['doctor', 'hospital'])

The resulting tables will have a column for each key column and
another "Count" column counting the number of instances of each value.


Distribution by percent
=======================

"Table.pivot()" can also be used to calculate the distribution of
values as a percentage of the total number:

   # Percents of a single column's values
   table.pivot('doctor', computation=agate.Percent('Count'))

   # Percents of all combinations of more than one column's values
   table.pivot(['doctor', 'hospital'], computation=agate.Percent('Count'))

The output table will be the same format as the previous example,
except the value column will be named "Percent".


Identify outliers
=================

The agate-stats extension adds methods for finding outliers.

   import agatestats

   agatestats.patch()

   outliers = table.stdev_outliers('salary', deviations=3, reject=False)

By specifying "reject=True" you can instead return a table including
only those values **not** identified as outliers.

   not_outliers = table.stdev_outliers('salary', deviations=3, reject=True)

The second, more robust, method for identifying outliers is by
identifying values which are more than some number of "median absolute
deviations" from the median (typically 3).

   outliers = table.mad_outliers('salary', deviations=3, reject=False)

As with the first example, you can specify "reject=True" to exclude
outliers in the resulting table.


Custom statistics
=================

You can also generate custom aggregated statistics for your data by
defining your own 'summary' aggregation. This might be especially
useful for performing calculations unique to your data. Here's a
simple example:

   # Create a custom summary aggregation with agate.Summary
   # Input a column name, a return data type and a function to apply on the column
   count_millionaires = agate.Summary('salary', agate.Number(), lambda r: sum(salary > 1000000 for salary in r.values()))

   table.aggregate([
       count_millionaires
   ])

Your custom aggregation can be used to determine both descriptive and
aggregate statistics shown above.
