Aggregate tools¶

Functions

`argmax_by_idx`(idx, values[, minlength, fill])	Given array of indexes `idx` and array `values`, outputs the argmax of the values by idx, aligned on `arange(idx.max() + 1)`.
`argmin_by_idx`(idx, values[, minlength, fill])	Given array of indexes `idx` and array `values`, outputs the argmin of the values by idx, aligned on `arange(idx.max() + 1)`.
`average_by_idx`(idx, values[, weights, ...])	Compute average-by-idx given array of indexes `idx`, `values`, and optional `weights`
`connect_adjacents_in_groups`(group_ids, ...)	For each group_id in `group_ids`, connect values that are closer than `max_gap` together.
`get_most_common_by_idx`(idx, values, fill[, ...])	Given array of indexes `idx` and array `values`, outputs the most common value by idx.
`get_value_by_idx`(idx, values, default[, ...])	Given array of indexes `idx` and array `values` (unordered, not necesarilly full), output array such that `out[i] = values[idx==i]`.
`igroupby`(ids, values[, n, logging_prefix, ...])	Efficiently converts two arrays representing a relation (the `ids` and the associated `values`) to an iterable `(id, values_associated)`.
`max_by_idx`(idx, values[, minlength, fill])	Given array of indexes `idx` and array `values`, outputs the max value by idx, aligned on `arange(idx.max() + 1)`.
`min_by_idx`(idx, values[, minlength, fill])	Given array of indexes `idx` and array `values`, outputs the max value by idx, aligned on `arange(idx.max() + 1)`.
`ufunc_group_by_idx`(idx, values, ufunc, init)	Abstract wrapper to compute ufunc grouped by values in array `idx`.
`value_at_argmax_by_idx`(idx, sorting_values, fill)	Wrapper around `argmax_by_idx` and `get_value_by_id`.
`value_at_argmin_by_idx`(idx, sorting_values, fill)	Wrapper around argmin_by_idx and get_value_by_idx.

igroupby(ids, values, n=None, logging_prefix=None, assume_sorted=False, find_next_hint=512)¶

Efficiently converts two arrays representing a relation (the ids and the associated values) to an iterable (id, values_associated).

The values are grouped by ids and a sequence of tuples is generated.

The i th tuple generated is (id_i, values[ids == id_i]), id_i being the i th element of the ids array, once sorted in ascending order.

Parameters

ids (array) – (>=n,) dtype array
values (array) – (>=n, *shape) uint32 array
n (int?) – length of array to consider, applying igroupby to (ids[:n], values[:n]). Uses full array when not set.
logging_prefix (string?) – prefix to include while logging progress. (default: Does not log``)``.
assume_sorted (bool?) – whether ids is sorted. (default: False)
find_next_hint (int?) – hint for find_next_lookup. (default: 512)

Generates

tuple(id:int, values_associated:(m, *shape) array slice)

Example

>>> ids      = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> gen = igroupby(ids, values)
>>> next(gen)
(0, array([0, 1, 2, 3, 4]))

>>> next(gen)
(1, array([0, 2, 4, 6]))

>>> next(gen)
(3, array([0, 4, 6]))

Example with strings as ids:

>>> ids = numpy.array(["alpha", "alpha", "beta", "omega", "alpha", "gamma", "beta"])
>>> values = numpy.array([1, 2, 10, 100, 3, 1000, 20])
>>> gen = igroupby(ids, values)
>>> next(gen)
('alpha', array([1, 2, 3]))
>>> next(gen)
('beta', array([10, 20]))
>>> next(gen)
('gamma', array([1000]))
>>> next(gen)
('omega', array([100]))

ufunc_group_by_idx(idx, values, ufunc, init, minlength=None)¶

Abstract wrapper to compute ufunc grouped by values in array idx.

Return an array containing the results of ufunc applied to values grouped by the indexes in array idx. (See available ufuncs here).

Warning: the init parameter is not a filling value for missing indexes. If index i is missing, then out[i] = init but this value also serves as the initialization of ufunc on all the groups of values.

For example, if ufunc is numpy.add and init = -1 then for each index, the sum of the corresponding values will be decreased by one.

Parameters

idx (array) – (n,) int array
values (array) – (n,) dtype array
ufunc (numpy.ufunc) – universal function applied to the groups of values
init (dtype) – initialization value
minlength (int?) – (default: idx.max() + 1)

Returns

(min-length,) dtype array, such that out[i] = ufunc(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> ufunc_group_by_idx(idx, values, numpy.maximum, -1)
array([ 4, 6,  -1,  6])
>>> ufunc_group_by_idx(idx, values, numpy.add, -1)
array([ 9, 11, -1,  9])
>>> ufunc_group_by_idx(idx, values, numpy.add, 0)
array([ 10, 12, -0,  10])

min_by_idx(idx, values, minlength=None, fill=None)¶

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1). See also argmin_by_idx and value_at_argmin_by_idx.

Parameters

idx (array) – (n,) int array
values (array) – (n,) float array
minlength (int?) – (default: idx.max() + 1)
fill (float?) – filling value for missing idx (default: +inf)

Returns

(min-length,) float array, such that out[i] = min(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([1, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> min_by_idx(idx, values, fill=100)
array([  1,   0, 100,   0])
>>> min_by_idx(idx, values)
array([1, 0, 9223372036854775807, 0])

max_by_idx(idx, values, minlength=None, fill=None)¶

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1). See also argmax_by_idx and value_at_argmax_by_idx.

Parameters

idx (array) – (n,) int array
values (array) – (n,) float array
minlength (int?) – (default: idx.max() + 1)
fill (float?) – filling value for missing idx (default: -inf)

Returns

(min-length,) float array, such that out[i] = max(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> max_by_idx(idx, values, fill=-1)
array([ 4, 6,  -1,  6])
>>> max_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  6, -1,  6, -1, -1, -1, -1, -1, -1])
>>> max_by_idx(idx, values)
array([ 4, 6, -9223372036854775808, 6])

argmin_by_idx(idx, values, minlength=None, fill=None)¶

Given array of indexes idx and array values, outputs the argmin of the values by idx, aligned on arange(idx.max() + 1). See also min_by_idx and value_at_argmin_by_idx.

Parameters

idx (array) – (n,) int array
values (array) – (n,) float array
minlength (int?) – (default: idx.max() + 1)
fill (float?) – filling value for missing idx (default: -1)

Returns

(min-length,) int32 array, such that out[i] = argmin_{idx}(values[idx] : idx[idx] == i)

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> argmin_by_idx(idx, values, fill=-1)
array([ 0,  5, -1,  9])
>>> argmin_by_idx(idx, values, minlength=10, fill=-1)
array([ 0,  5, -1,  9, -1, -1, -1, -1, -1, -1])

value_at_argmin_by_idx(idx, sorting_values, fill, output_values=None, minlength=None)¶

Wrapper around argmin_by_idx and get_value_by_idx. Allows to use a different value for the output and for detecting the minimum Allows to set a specific fill value that is not compared with the sorting_values

Parameters

idx (array) – (n,) uint array with values < max_idx
values (array) – (n,) array
fill – filling value for output[i] if there is no idx == i
output_values (array?) – (n,) dtype array Useful if you want to select the min based on one array, and get the value on another array
minlength (int?) – minimum shape for the output array.

Returns array

(max_idx+1,), dtype array such that out[i] = min(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> value_at_argmin_by_idx(idx, values, fill=-1)
array([ 0,  0, -1,  0])
>>> value_at_argmin_by_idx(idx, values, minlength=10, fill=-1)
array([ 0,  0, -1,  0, -1, -1, -1, -1, -1, -1])

argmax_by_idx(idx, values, minlength=None, fill=None)¶

Given array of indexes idx and array values, outputs the argmax of the values by idx, aligned on arange(idx.max() + 1). See also max_by_idx and value_at_argmax_by_idx.

Parameters

idx (array) – (n,) int array
values (array) – (n,) float array
minlength (int?) – (default: idx.max() + 1)
fill (float?) – filling value for missing idx (default: -1)

Returns

(min-length,) int32 array, such that out[i] = argmax_{idx}(values[idx] : idx[idx] == i)

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> argmax_by_idx(idx, values, fill=-1)
array([ 4,  8, -1, 11])
>>> argmax_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  8, -1, 11, -1, -1, -1, -1, -1, -1])

value_at_argmax_by_idx(idx, sorting_values, fill, output_values=None, minlength=None)¶

Wrapper around argmax_by_idx and get_value_by_id. Allows to use a different value for the output and for detecting the minimum Allows to set a specific fill value that is not compared with the sorting_values

Parameters

idx (array) – (n,) uint array with values < max_idx
values (array) – (n,) array
fill – filling value for output[i] if there is no idx == i
output_values (array?) – (n,) dtype array Useful if you want to select the min based on one array, and get the value on another array
minlength (int?) – minimum shape for the output array.

Returns array

(max_idx+1,), dtype array such that out[i] = max(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> value_at_argmax_by_idx(idx, values, fill=-1)
array([ 4,  6, -1,  6])
>>> value_at_argmax_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  6, -1,  6, -1, -1, -1, -1, -1, -1])

connect_adjacents_in_groups(group_ids, values, max_gap)¶

For each group_id in group_ids, connect values that are closer than max_gap together.

Return an array mapping the values to the indexes of the newly formed connected components they belong to.

Two values that don’t have the same input group_id can’s be connected in the same connected component.

connect_adjacents_in_groups is faster when an array of indexes is provided as group_ids, but also accepts other types of ids.

Parameters

group_ids (array) – (n,) dtype array
values (array) – (n,) float array
max_gap (float) – maximum distance between a value and the nearest value in the same group.

Returns

(n,) uint array, such that out[s[i]]==out[s[i+1]] \(\iff\) group_ids[s[i]]==group_ids[s[i+1]] & |values[s[i]]-values[s[i+1]]| <= max_gap where s[i] is the i -th index when sorting by id and value

Example

>>> group_ids = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3, 3])
>>> values = numpy.array([ 0, 35, 20, 25, 30,  0,  5, 10, 20,  0,  5, 10, 15])
>>> connect_adjacents_in_groups(group_ids, values, max_gap = 5)
array([0, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 4], dtype=uint32)

Example with string group_ids:

>>> group_ids = numpy.array(['alpha', 'alpha', 'alpha', 'alpha', 'alpha', 'beta', 'beta', 'beta', 'beta', 'gamma', 'gamma', 'gamma', 'gamma'])
>>> values = numpy.array([ 0, 35, 20, 25, 30,  0,  5, 10, 20,  0,  5, 10, 15])
>>> connected_components_ids = connect_adjacents_in_groups(group_ids, values, max_gap = 5)

The function does not require the group_ids or the values to be sorted:

>>> shuffler = numpy.random.permutation(len(group_ids))
>>> group_ids_shuffled = group_ids[shuffler]
>>> values_shuffled = values[shuffler]
>>> connect_adjacents_in_groups(group_ids_shuffled, values_shuffled, max_gap = 5)
array([2, 1, 0, 2, 4, 1, 1, 4, 1, 4, 3, 2, 4], dtype=uint32)
>>> connected_components_ids[shuffler]
array([2, 1, 0, 2, 4, 1, 1, 4, 1, 4, 3, 2, 4], dtype=uint32)

get_value_by_idx(idx, values, default, check_unique=True, minlength=None)¶

Given array of indexes idx and array values (unordered, not necesarilly full), output array such that out[i] = values[idx==i].

If all indexes in idx are unique, it is equivalent to sorting the values by their idx and filling with default for missing idx.

If idx elements are not unique and you still want to proceed, you can set check_unique to False. The output values for the non-unique indexes will be chosen arbitrarily among the multiple values corresponding.

Parameters

idx (array) – (n,) uint array with values < max_idx
values (array) – (n,) dtype array
default (dtype) – filling value for output[i] if there is no idx == i
check_unique (bool) – if True, will check that idx are unique If False, if the idx are not unique, then an arbitrary value will be chosen.
minlength (int?) – minimum shape for the output array (default: idx.max() + 1).

Returns array

(max_idx+1,), dtype array such that out[i] = values[idx==i].

Example

>>> idx = numpy.array([8,2,4,7])
>>> values = numpy.array([100, 200, 300, 400])
>>> get_value_by_idx(idx, values, -1, check_unique=False, minlength=None)
array([ -1,  -1, 200,  -1, 300,  -1,  -1, 400, 100])

Example with non-unique elements in idx:

>>> idx = numpy.array([2,2,4,7])
>>> values = numpy.array([100, 200, 300, 400])
>>> get_value_by_idx(idx, values, -1, check_unique=False, minlength=None)
array([ -1,  -1, 200,  -1, 300,  -1,  -1, 400])

get_most_common_by_idx(idx, values, fill, minlength=None)¶

Given array of indexes idx and array values, outputs the most common value by idx.

Parameters

idx (array) – (n,) uint array with values < max_idx
values (array) – (n,) non-float, dtype array
fill – filling value for output[i] if there is no idx == i
minlength – minimum shape for the output array.

Returns

(max_idx+1,), dtype array such that out[i] = the most common value such that (values[idx==i])

average_by_idx(idx, values, weights=None, minlength=None, fill=0, dtype='float64')¶

Compute average-by-idx given array of indexes idx, values, and optional weights

Parameters

idx (array) – (n,) int array
values (array) – (n,) float array
weights (array?) – (n,) float array
minlength (int?) – (default: idx.max() + 1)
fill (float?) – filling value for missing idx (default: 0)
dtype (str?) – (default: ‘float32’)

Returns

(min-length,) float array, such that out[i] = mean(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> average_by_idx(idx, values, fill=0)
array([ 2.        ,  3.        , 0.        ,  3.33333333])
>>> weights = numpy.array([0, 1, 0, 0, 0, 1, 2, 3, 4, 1, 1, 0])
>>> average_by_idx(idx, values, weights=weights, fill=0)
array([ 1.,  4., 0.,  2.])