## Abstract

Histogram construction or sequence segmentation is a basic task with applications in database systems, information retrieval, and knowledge management. Its aim is to approximate a sequence by line segments. Unfortunately, the quadratic algorithm that derives an optimal histogram for Euclidean error lacks the desired scalability. Therefore, sophisticated approximation algorithms have been recently proposed, while several simple heuristics are used in practice. Still, these solutions fail to resolve the efficiency-quality tradeoffs in a satisfactory manner. In this paper we take a fresh view on the problem. We propose conceptually clear and scalable algorithms that efficiently derive high-quality histograms. We experimentally demonstrate that existing approximation schemes fail to deliver the desired efficiency and conventional heuristics do not fare well on the side of quality. On the other hand, our schemes match or exceed the quality of the former and the efficiency of the latter.

## Background and Related Work

Given a finite data sequence D = d0, d1, ..., dn-1. Construct a histogram HB from D using at most B buckets that minimizes the total error of all the buckets. The error for a bucket is defined as the sum of squared difference between all elements in that bucket with the mean value of that bucket. This problem can be solved optimally using Dynamic Programming (DP). However, the running time of the optimal construction is quadratic O(n^2 * B). Therefore, approximations algorithms have been proposed recently to get high quality solution efficiently such as AHistL and DnS. AHistL decompose the problem into two parts. In the inner part gives the error for a parameter value is within a "good" range. The outer part searches for the appropriate range and sets the parameter. The parameters to this algorithm are delta and epsilon. The delta is adjusted over time by the outer part while the epsilon is a predetermined parameter. The smaller the epsilon, the more accurate the result (the denser the list generated in the inner part) and the slower the algorithm. The complexity of this algorithm is O(n + B3 * (e-2 + log(n)) * log(n)). AHistL is not practical when B is large and as shown in our experiments, after a certain value of B, it is better off to switch to the optimal construction algorithm. DnS (Divide and Segment) works by splitting the input sequence into X sub-partitions. Then it segments each of the X sub-partitions (into B partitions each) optimally using DP. Each sub-partition produces B partitions (samples), yielding a total of X * B samples. The optimal DP segmentation is run on the X * B samples to get the final best of B partitions. Choosing X = (n/B)2/3 yield the best complexity of O(n4/3 * B2/3).

## Our Contributions

We developed an algorithm that can produce good quality (low total error) as fast as possible. We started by developing an effective and efficient greedy-based local search algorithm and used it to produce high quality samples. The samples are used to generate the final B partitions optimally using DP. We experimentally show that the number of samples collected is less than those of DnS and yet have better quality thus giving a better segmentation results and faster runtime than DnS. We also compared the time vs. quality tradeoffs with many approximation algorithms. Even though we could not give the bound for the error, we can show the efficiency and the quality of our algorithms using some well known datasets. The experiment results showed that our algorithm matches or outperforms the approximation algorithms for both quality and speed, especially for large number of elements (n) and segments (B).

Our greedy-based local search algorithm (GDY) works by moving from one solution to another. A solution is a set of partitions S containing B+1 elements: s0,s1,s2,...,sB where s0 is 0 and sB is n and s1,s2,..,s(B-1) are the "movable" partitions that separates the n elements into B buckets. GDY continuously moves one solution to another by picking one movable partition that yields to the least increase in total error and put it in between partitions that yields the most decrease in total error, until no more improving move can be performed (stuck in a local optima). The performance of the GDY algorithm matches the performance of the one dimensional variant of MHIST heuristic while giving better quality over a set of datasets that we tested. We can perform multiple runs/iterations (I) of the GDY algorithm by randomizing the initial partitions to produce high quality samples which are then used to construct the final B partitions using DP. We call this algorithm GDY_(I)DP, where I is a small constant (i.e, 4-32). With GDY_32DP manage to outperform DnS in term of runtime by an order of magnitude faster, while matching the segmentation quality of DnS. We further scale our algorithm, to cope with very large n ~ 100,000 as in the synthetic dataset, by performing the DP in batch on the samples. We create a batch for (roughly) every sqrt(n) consecutive (adjacent) samples. For each batch, we have an estimate on how many partitions P should be given to that batch. This estimate is acquired by running a GDY algorithm and count how many partitions fall into that batch range. Then we do DP on the samples in the batch to produce the same number of P partitions thus optimizing the locations of the P partitions inside the batch. The batch strategy (GDY_BDP) maintains the linear runtime complexity in terms of n and B, as well as the segmentation quality in segmenting very large data sequence.

### Greedy Runs Produce Good Samples

A single GDY run produces a set of partitions (samples). Surprisingly, more than 50% of the partitions are the partitions that belong to the global optimum partitions. Running the randomized greedy algorithm multiple times can improve the percentage of samples that belong to the optimum partitions. DP is able to pick the final best B partitions out of the samples yielding a solution that is very close to the optimal. Furthermore, by modifying the DP to show the number of unique optimal solutions, we can examine the datasets we use and we discover that there is only one global optimum for most values of B. Our GDY sampling correctly samples partitions near the global optimum positions.

Play with the tool below to see the samples generated by running 32 iterations of GDY. The red vertical lines are the global optimum partitions. We also include the uniform sampling that is used by DnS for comparisons.

Datasets:   Segments:

We can see that the number of samples obtained from running 32 GDYs (the green dots) is mostly less than DnS (the red dots) and the samples are mostly very good (the samples are all close to the optimal segmentation positions). This shows that GDY produces better samples (in quality and in number of samples) than DnS. Since DnS uses uniform sampling for each X sub-partitions, it maybe the case that the first 90% of the input is a very steady sequence and the last 10% is a random sequence. In this case, if DnS splits into X = 10 partitions then the first 9*B samples from the first 9 partitions are useless (bad samples) and only the last B samples are good. Our Greedy will be most likely treat the first 90% of the sequence as a single bucket and samples more on the last 10% of the input sequence. This shows why uniform sampling can be a poor way of sampling.

## Datasets

```balloon      2001 x 2   http://lib.stat.cmu.edu/datasets/
darwin       1400 x 1   http://www.stat.duke.edu/~mw/ts_data_sets.html
shuttle      1000 x 6   http://www-aig.jpl.nasa.gov/public/mls/time-series/
winding      2500 x 7   http://www.esat.kuleuven.ac.be/~tokka/daisydata.html
phone        1708 x 8   http://www.teco.edu/tea/datasets/phone1.xls
exrates      2567 x 12  http://www.stat.duke.edu/data-sets/mw/ts_data/all_exrates.html
djdc0093    26612 x 2   http://lib.stat.cmu.edu/datasets/djdc0093
djia16K     16384 x 1   http://lib.stat.cmu.edu/datasets/djdc0093 (filtered)
synthetic  100001 x 10  http://kdd.ics.uci.edu/databases/synthetic/synthetic.html
```

## Comparison Tools

 Dataset: B: - n: -