Aggregate tools

Functions

argmax_by_idx(idx, values[, minlength, fill])

Given array of indexes idx and array values, outputs the argmax of the values by idx, aligned on arange(idx.max() + 1).

argmin_by_idx(idx, values[, minlength, fill])

Given array of indexes idx and array values, outputs the argmin of the values by idx, aligned on arange(idx.max() + 1).

average_by_idx(idx, values[, weights, ...])

Compute average-by-idx given array of indexes idx, values, and optional weights

connect_adjacents_in_groups(group_ids, ...)

For each group_id in group_ids, connect values that are closer than max_gap together.

get_most_common_by_idx(idx, values, fill[, ...])

Given array of indexes idx and array values, outputs the most common value by idx.

get_value_by_idx(idx, values, default[, ...])

Given array of indexes idx and array values (unordered, not necesarilly full), output array such that out[i] = values[idx==i].

igroupby(ids, values[, n, logging_prefix, ...])

Efficiently converts two arrays representing a relation (the ids and the associated values) to an iterable (id, values_associated).

max_by_idx(idx, values[, minlength, fill])

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1).

min_by_idx(idx, values[, minlength, fill])

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1).

ufunc_group_by_idx(idx, values, ufunc, init)

Abstract wrapper to compute ufunc grouped by values in array idx.

value_at_argmax_by_idx(idx, sorting_values, fill)

Wrapper around argmax_by_idx and get_value_by_id.

value_at_argmin_by_idx(idx, sorting_values, fill)

Wrapper around argmin_by_idx and get_value_by_idx.

igroupby(ids, values, n=None, logging_prefix=None, assume_sorted=False, find_next_hint=512)

Efficiently converts two arrays representing a relation (the ids and the associated values) to an iterable (id, values_associated).

The values are grouped by ids and a sequence of tuples is generated.

The i th tuple generated is (id_i, values[ids == id_i]), id_i being the i th element of the ids array, once sorted in ascending order.

Parameters
  • ids (array) – (>=n,) dtype array

  • values (array) – (>=n, *shape) uint32 array

  • n (int?) – length of array to consider, applying igroupby to (ids[:n], values[:n]). Uses full array when not set.

  • logging_prefix (string?) – prefix to include while logging progress. (default: Does not log``)``.

  • assume_sorted (bool?) – whether ids is sorted. (default: False)

  • find_next_hint (int?) – hint for find_next_lookup. (default: 512)

Generates

tuple(id:int, values_associated:(m, *shape) array slice)

Example

>>> ids      = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> gen = igroupby(ids, values)
>>> next(gen)
(0, array([0, 1, 2, 3, 4]))
>>> next(gen)
(1, array([0, 2, 4, 6]))
>>> next(gen)
(3, array([0, 4, 6]))

Example with strings as ids:

>>> ids = numpy.array(["alpha", "alpha", "beta", "omega", "alpha", "gamma", "beta"])
>>> values = numpy.array([1, 2, 10, 100, 3, 1000, 20])
>>> gen = igroupby(ids, values)
>>> next(gen)
('alpha', array([1, 2, 3]))
>>> next(gen)
('beta', array([10, 20]))
>>> next(gen)
('gamma', array([1000]))
>>> next(gen)
('omega', array([100]))
ufunc_group_by_idx(idx, values, ufunc, init, minlength=None)

Abstract wrapper to compute ufunc grouped by values in array idx.

Return an array containing the results of ufunc applied to values grouped by the indexes in array idx. (See available ufuncs here).

Warning: the init parameter is not a filling value for missing indexes. If index i is missing, then out[i] = init but this value also serves as the initialization of ufunc on all the groups of values.

For example, if ufunc is numpy.add and init = -1 then for each index, the sum of the corresponding values will be decreased by one.

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) dtype array

  • ufunc (numpy.ufunc) – universal function applied to the groups of values

  • init (dtype) – initialization value

  • minlength (int?) – (default: idx.max() + 1)

Returns

(min-length,) dtype array, such that out[i] = ufunc(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> ufunc_group_by_idx(idx, values, numpy.maximum, -1)
array([ 4, 6,  -1,  6])
>>> ufunc_group_by_idx(idx, values, numpy.add, -1)
array([ 9, 11, -1,  9])
>>> ufunc_group_by_idx(idx, values, numpy.add, 0)
array([ 10, 12, -0,  10])
min_by_idx(idx, values, minlength=None, fill=None)

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1). See also argmin_by_idx and value_at_argmin_by_idx.

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) float array

  • minlength (int?) – (default: idx.max() + 1)

  • fill (float?) – filling value for missing idx (default: +inf)

Returns

(min-length,) float array, such that out[i] = min(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([1, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> min_by_idx(idx, values, fill=100)
array([  1,   0, 100,   0])
>>> min_by_idx(idx, values)
array([1, 0, 9223372036854775807, 0])
max_by_idx(idx, values, minlength=None, fill=None)

Given array of indexes idx and array values, outputs the max value by idx, aligned on arange(idx.max() + 1). See also argmax_by_idx and value_at_argmax_by_idx.

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) float array

  • minlength (int?) – (default: idx.max() + 1)

  • fill (float?) – filling value for missing idx (default: -inf)

Returns

(min-length,) float array, such that out[i] = max(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> max_by_idx(idx, values, fill=-1)
array([ 4, 6,  -1,  6])
>>> max_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  6, -1,  6, -1, -1, -1, -1, -1, -1])
>>> max_by_idx(idx, values)
array([ 4, 6, -9223372036854775808, 6])
argmin_by_idx(idx, values, minlength=None, fill=None)

Given array of indexes idx and array values, outputs the argmin of the values by idx, aligned on arange(idx.max() + 1). See also min_by_idx and value_at_argmin_by_idx.

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) float array

  • minlength (int?) – (default: idx.max() + 1)

  • fill (float?) – filling value for missing idx (default: -1)

Returns

(min-length,) int32 array, such that out[i] = argmin_{idx}(values[idx] : idx[idx] == i)

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> argmin_by_idx(idx, values, fill=-1)
array([ 0,  5, -1,  9])
>>> argmin_by_idx(idx, values, minlength=10, fill=-1)
array([ 0,  5, -1,  9, -1, -1, -1, -1, -1, -1])
value_at_argmin_by_idx(idx, sorting_values, fill, output_values=None, minlength=None)

Wrapper around argmin_by_idx and get_value_by_idx. Allows to use a different value for the output and for detecting the minimum Allows to set a specific fill value that is not compared with the sorting_values

Parameters
  • idx (array) – (n,) uint array with values < max_idx

  • values (array) – (n,) array

  • fill – filling value for output[i] if there is no idx == i

  • output_values (array?) – (n,) dtype array Useful if you want to select the min based on one array, and get the value on another array

  • minlength (int?) – minimum shape for the output array.

Returns array

(max_idx+1,), dtype array such that out[i] = min(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> value_at_argmin_by_idx(idx, values, fill=-1)
array([ 0,  0, -1,  0])
>>> value_at_argmin_by_idx(idx, values, minlength=10, fill=-1)
array([ 0,  0, -1,  0, -1, -1, -1, -1, -1, -1])
argmax_by_idx(idx, values, minlength=None, fill=None)

Given array of indexes idx and array values, outputs the argmax of the values by idx, aligned on arange(idx.max() + 1). See also max_by_idx and value_at_argmax_by_idx.

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) float array

  • minlength (int?) – (default: idx.max() + 1)

  • fill (float?) – filling value for missing idx (default: -1)

Returns

(min-length,) int32 array, such that out[i] = argmax_{idx}(values[idx] : idx[idx] == i)

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> argmax_by_idx(idx, values, fill=-1)
array([ 4,  8, -1, 11])
>>> argmax_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  8, -1, 11, -1, -1, -1, -1, -1, -1])
value_at_argmax_by_idx(idx, sorting_values, fill, output_values=None, minlength=None)

Wrapper around argmax_by_idx and get_value_by_id. Allows to use a different value for the output and for detecting the minimum Allows to set a specific fill value that is not compared with the sorting_values

Parameters
  • idx (array) – (n,) uint array with values < max_idx

  • values (array) – (n,) array

  • fill – filling value for output[i] if there is no idx == i

  • output_values (array?) – (n,) dtype array Useful if you want to select the min based on one array, and get the value on another array

  • minlength (int?) – minimum shape for the output array.

Returns array

(max_idx+1,), dtype array such that out[i] = max(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> value_at_argmax_by_idx(idx, values, fill=-1)
array([ 4,  6, -1,  6])
>>> value_at_argmax_by_idx(idx, values, minlength=10, fill=-1)
array([ 4,  6, -1,  6, -1, -1, -1, -1, -1, -1])
connect_adjacents_in_groups(group_ids, values, max_gap)

For each group_id in group_ids, connect values that are closer than max_gap together.

Return an array mapping the values to the indexes of the newly formed connected components they belong to.

Two values that don’t have the same input group_id can’s be connected in the same connected component.

connect_adjacents_in_groups is faster when an array of indexes is provided as group_ids, but also accepts other types of ids.

Parameters
  • group_ids (array) – (n,) dtype array

  • values (array) – (n,) float array

  • max_gap (float) – maximum distance between a value and the nearest value in the same group.

Returns

(n,) uint array, such that out[s[i]]==out[s[i+1]] \(\iff\) group_ids[s[i]]==group_ids[s[i+1]] & |values[s[i]]-values[s[i+1]]| <= max_gap where s[i] is the i -th index when sorting by id and value

Example

>>> group_ids = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3, 3])
>>> values = numpy.array([ 0, 35, 20, 25, 30,  0,  5, 10, 20,  0,  5, 10, 15])
>>> connect_adjacents_in_groups(group_ids, values, max_gap = 5)
array([0, 1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 4, 4], dtype=uint32)

Example with string group_ids:

>>> group_ids = numpy.array(['alpha', 'alpha', 'alpha', 'alpha', 'alpha', 'beta', 'beta', 'beta', 'beta', 'gamma', 'gamma', 'gamma', 'gamma'])
>>> values = numpy.array([ 0, 35, 20, 25, 30,  0,  5, 10, 20,  0,  5, 10, 15])
>>> connected_components_ids = connect_adjacents_in_groups(group_ids, values, max_gap = 5)

The function does not require the group_ids or the values to be sorted:

>>> shuffler = numpy.random.permutation(len(group_ids))
>>> group_ids_shuffled = group_ids[shuffler]
>>> values_shuffled = values[shuffler]
>>> connect_adjacents_in_groups(group_ids_shuffled, values_shuffled, max_gap = 5)
array([2, 1, 0, 2, 4, 1, 1, 4, 1, 4, 3, 2, 4], dtype=uint32)
>>> connected_components_ids[shuffler]
array([2, 1, 0, 2, 4, 1, 1, 4, 1, 4, 3, 2, 4], dtype=uint32)
get_value_by_idx(idx, values, default, check_unique=True, minlength=None)

Given array of indexes idx and array values (unordered, not necesarilly full), output array such that out[i] = values[idx==i].

If all indexes in idx are unique, it is equivalent to sorting the values by their idx and filling with default for missing idx.

If idx elements are not unique and you still want to proceed, you can set check_unique to False. The output values for the non-unique indexes will be chosen arbitrarily among the multiple values corresponding.

Parameters
  • idx (array) – (n,) uint array with values < max_idx

  • values (array) – (n,) dtype array

  • default (dtype) – filling value for output[i] if there is no idx == i

  • check_unique (bool) – if True, will check that idx are unique If False, if the idx are not unique, then an arbitrary value will be chosen.

  • minlength (int?) – minimum shape for the output array (default: idx.max() + 1).

Returns array

(max_idx+1,), dtype array such that out[i] = values[idx==i].

Example

>>> idx = numpy.array([8,2,4,7])
>>> values = numpy.array([100, 200, 300, 400])
>>> get_value_by_idx(idx, values, -1, check_unique=False, minlength=None)
array([ -1,  -1, 200,  -1, 300,  -1,  -1, 400, 100])

Example with non-unique elements in idx:

>>> idx = numpy.array([2,2,4,7])
>>> values = numpy.array([100, 200, 300, 400])
>>> get_value_by_idx(idx, values, -1, check_unique=False, minlength=None)
array([ -1,  -1, 200,  -1, 300,  -1,  -1, 400])
get_most_common_by_idx(idx, values, fill, minlength=None)

Given array of indexes idx and array values, outputs the most common value by idx.

Parameters
  • idx (array) – (n,) uint array with values < max_idx

  • values (array) – (n,) non-float, dtype array

  • fill – filling value for output[i] if there is no idx == i

  • minlength – minimum shape for the output array.

Returns

(max_idx+1,), dtype array such that out[i] = the most common value such that (values[idx==i])

average_by_idx(idx, values, weights=None, minlength=None, fill=0, dtype='float64')

Compute average-by-idx given array of indexes idx, values, and optional weights

Parameters
  • idx (array) – (n,) int array

  • values (array) – (n,) float array

  • weights (array?) – (n,) float array

  • minlength (int?) – (default: idx.max() + 1)

  • fill (float?) – filling value for missing idx (default: 0)

  • dtype (str?) – (default: ‘float32’)

Returns

(min-length,) float array, such that out[i] = mean(values[idx==i])

Example

>>> idx  = numpy.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 3, 3, 3])
>>> values   = numpy.array([0, 1, 2, 3, 4, 0, 2, 4, 6, 0, 4, 6])
>>> average_by_idx(idx, values, fill=0)
array([ 2.        ,  3.        , 0.        ,  3.33333333])
>>> weights = numpy.array([0, 1, 0, 0, 0, 1, 2, 3, 4, 1, 1, 0])
>>> average_by_idx(idx, values, weights=weights, fill=0)
array([ 1.,  4., 0.,  2.])