So far, it's not terrible, but has some pretty big flaws.
Especially with time delays this and 3+ attractors this can be problematic.
A simple example:
https://doi.org/10.21203/rs.3.rs-1088857/v1
There are tools to try and detect these features that were found over the past few decades, and I know I wasted a few years on a project that superficially looked like a FP issue, but ended up being a mix of the wada property and/or porous sets.
The complications will describing these worse than traditional chaos indeterminate situations may make it inappropriate for you.
But it would be nice if visibility was increased. Funny enough most LLMs corpus is mostly fed from a LSAT question.
There has been a lot of movement here when you have n>=3 attractors/exits.
Not solutions unfortunately, but tools to help figure out when you hit it.
The biggest thing I've run into in my testing is that an anomaly of reasonably short timeframe seems to throw the upper and lower bands off for quite some time.
That being said, perhaps changing some of the variables would help with that, and I just don't have enough skill to be able to understand the exact way to adjust that.
Could you describe your use case around "exportable to any code" a bit more?
I'm not really smart in these areas, but it feels like forecasting and anomaly detection are pretty related. I could be wrong though.
For years, I'd been casually observing the daily numbers (2 draws daily for each since around ?/?/2004?, and 1 prior), which are Pick2, 3, 4, and 5, but mostly Pick4, which is 4 digits, thus has 1:10,000 odds, vs 1:100, 1:1000 and 1:100,000 for the others.
With truly random numbers, it is pretty difficult to identify anything but glaring anomalies. Among some of the tests performed were: (clusters\daily\weekly; (isolated forest; (popular permutations\by date\special holidays\etc; (individual digits\deviations; (temporal frequency; (dbscan; (zscore; (patterns; (correlation; (external factors; (auto correlation by cluster; (predictive modeling; (chi squared; (Time Series ... and a few more I've forgotten.
For those wondering why I'd do this, around 2023-23, the FL Lottery drastically modified their website. Previously, one could enter a number for the game of their choice and receive all historical permutations of that number over all years, going back to the 1990s. With the new modification, the permutations have been eliminated and the history only shows for 2 years. The only option for the complete history is to download the provided PDF -- however, it is full of extraneous characters and cannot be readily searched via Ctrl-F, etc. Processing this PDF involves extensive character removal to render it parsable or modestly readable. So to restore the previously functional search ability, manual work is required. The seemingly deliberate obfuscation, or obstruction, was my motivation. The perceived anomalies over the years were secondary, as I am capable of little more than speculation without proper testing. But those two factors intrigued me.
Having no background in math and only feeble abilities in programming, this was a task that I could not have performed without LLMs and python code used for the tests. The test is still incomplete, having increased in complexity as I progressed and left me too tired to persist. The result were ultimately within acceptable ranges of randomness, but some patterns were present. I had made files of (all numbers that ever occurred; (all numbers that have never occurred; (popularity of single, isolated digits -- I was actually correct in my intuition here, which proved certain single were occurring with lesser or greater frequencies as would be expected; (a script to apply Optical Character Recognition on the website and append the latest results to a living text and PDF file to offer anyone interested an opportunity to freely search, parse and analyze the numbers. But I couldn't quite wangle the OCR successfully.
Working with a set over 60k individual number sets, looking for anomalies over a 30 year period; if there are other methods anyone would suggest, please offer them and I might resume this abandoned project.
Something like: ghostpdf to convert PDF into images, then gpt-4o or ollama+Llama3 to transcribe each image into output JSON.
The PDF is thousands of newlines, with multiple entries on each, convoluted with lots of garbage formatting. The only data to be preserved is winning num, date, evening/midday draw, fireball number (introduced in 2020-ish?) and factored along with the change from one to two daily draws (2001-ish) as acceptable anomalies in the data set, which has been done, I believe.
The difficult part of this, actually, was cleaning the squalid pdf.
In the end, after the work was all done, script using OCR would successfully append a number to the cleaned text/pdf, but usually not the correct num. The only reason I used OCR was that I couldn't find the right frames in the webpage that contained the latest winning numbers, and getting html extraction to to work in a script failed because of it.
I must admit, although I have used JSON files, I don't know much about them. Additionally, I'm ignorant enough that it's probably best not to attempt to advise me too much here for sake of thread sanitation - it could get bloated and off topic :)
I think with renewed inspiration, I could figure out a successful method to keep the public file updated, but I primarily need surefire methods of analysis upon the nums for my anomaly detection, which is a challenge for a caveman who never went to middle/high school and didn't resume school beyond 4th grade until community college much later. Of course, the fact that such an animal can twiddle with statistics and data analysis is a big testament to the positive attributes of LLMs, which without, the pursuit would be a vague thought at most.
Although I welcome and appreciate any feedback, I'm pretty sure it isn't too welcome here. I'll try to make sense of your suggestions though.
https://apim-website-prod-eastus.azure-api.net/drawgamesapp/...
Gets you pick 4 for the 6 Jan, easy to parse.
.... "FireballPayouts": 18060.5, "DrawNumbers": [ { "NumberPick": 3, "NumberType": "wn1" }, { "NumberPick": 0, "NumberType": "wn2" }, { "NumberPick": 0, "NumberType": "wn3" }, { "NumberPick": 4, "NumberType": "wn4" }, { "NumberPick": 1, "NumberType": "fb" } ...
This is critical; and difficult to deal with in many instances.
> With the term anomalies we refer to data points or groups of data points that do not conform to some notion of normality or an expected behavior based on previously observed data.
This is a key problem or perhaps the problem: rigorously or precisely defining what an anomaly is and is not.
We built a system to detect exceptions before they happened, and act on them, hoping that this would be better than letting them happen (e.g. preemptively slow down the rate of requests instead of leading to database exhaustion)
At the time, I felt that there was soooooooo much to do in the area, and I'm kinda sad I never worked on it again.
There is a similar algorithm with a simpler implementation in this paper: „GraphTS: Graph-represented time series for subsequence anomaly detection“ https://pmc.ncbi.nlm.nih.gov/articles/PMC10431630/
The approach is for univariate time series and I found it to perform well (with very minor tweaks).
For example some former colleagues timeseries foundation model (Granite TS) which was doing pretty well when we were experimenting with it. [1]
An aha moment for me was realizing that the way you can think of anomaly models working is that they’re effectively forecasting the next N steps, and then noticing when the actual measured values are “different enough” from the expected. This is simple to draw on a whiteboard for one signal but when it’s multi variate, pretty neat that it works.
[1] https://huggingface.co/ibm-granite/granite-timeseries-ttm-r1
> TTM-1 currently supports 2 modes:
> Zeroshot forecasting: Directly apply the pre-trained model on your target data to get an initial forecast (with no training).
> Finetuned forecasting: Finetune the pre-trained model with a subset of your target data to further improve the forecast
My naive view was that there was some sort of “normalization” or “pattern matching” that was happening. Like - you can look at a trend line that generally has some shape, and notice when something changes or there’s a discontinuity. That’s a very simplistic view - but - I assumed that stuff was trying to do regressions and notice when something was out of a statistical norm like k-means analysis. Which works, sort of, but is difficult to generalize.
what you describe here is effectively forecasting a model of what is expected to happen and then you notice a deviation from it.
[0] https://scikit-learn.org/stable/modules/generated/sklearn.en...
* Manufacturing: Computer vision to pick anomalies off the assembly line.
* Operation: Accelerometers/temperature sensors w/ frequency analysis to detect onset of faults (prognostics / diagnostics) and do predictive maintenance.
* Sales: Timeseries analyses on numbers / support calls to detect up/downticks in cashflows, customer satisfaction etc.
The Matrix Profile is honestly one of the most underrated tools in the time series analysis space - it's ridiculously efficient. The killer feature is how it just works for finding motifs and anomalies without having to mess around with window sizes and thresholds like you do with traditional techniques. Solid across domains too, from manufacturing sensor data to ECG analysis to earthquake detection.
[a] Matrix Profile XXX: MADRID: A Hyper-Anytime Algorithm to Find Time Series Anomalies of all Lengths. Yue Lu, Thirumalai Vinjamoor Akhil Srinivas, Takaaki Nakamura, Makoto Imamura, and Eamonn Keogh. ICDM 2023.
This is a better presentation by the same folks. https://matrixprofile.org/
I get that they could be explicitly modeling a data generating process's probabilty itself(just like a NN) like of a Bernoulli(whose ML function is X-Entropy) or a Normal(ML function Mean Square loss), but I don't think that is what the author meant by a Distribution .
My understandin is that they don't make distributional assumption on the random variable(your Y or X) they are trying to find a max margin for.