Help! for ilo Muni
This page is about how to use and understand ilo Muni, which you can work through like a tutorial or a manual. If you want to know how or why ilo Muni exists, or want to talk to me, see the about page!
There’s also a cheatsheet for using ilo Muni on the main page, under the help button.
Table of Contents
Search
Words
Search for a toki pona word such as pona. If it appeared at least 40 times across all the places and times I checked, you’ll get a graph showing how that word has been used over time! By default, you’ll see what percentage of all words were the searched word with monthly datapoints.
You can also search for proper names like Sonko, Inli, or Siko. There are lots of very rare words in the database too. You might be surprised by what you find, so try lots of things!
Examples
Phrases
You can search for up to 6 words in one phrase, such as mi kama sona e toki pona. Note that the more words are in a phrase, the fewer times it is likely to appear- so don’t be surprised if you don’t get a result. Try searching for shorter phrases, like kama pona or anu seme.
Examples
- Watch the rising popularity of phrases like kin la
- Watch phrases come and go over time as the things they reference do, like tenpo pana, tenpo monsuta, or suno pi toki pona
- Examine the use of grammatical features like kepeken e
Multiple Searches
You can graph multiple phrases at once by separating them with commas ,
,
even mixing words and phrases:
toki, pona, toki pona. Often,
different words or phrases will have very different amounts of use, resulting in
it being hard to read more than the top one or two graphs. You can address that
with alternate scales later in this tutorial.
Examples
- Compare former synonyms like lukin, oko
- See the close relationship between similar words like pan, kili or laso, loje, jelo, walo, pimeja
- Compare the modifiers that appear after specific words, like wawa a, wawa mute, wawa lili, wawa suli, wawa sewi
- Examine how people talk about their skill in toki pona with sona toki pona, sona e toki pona, sona pi toki pona
- Compare greetings over time, like sina seme, sina pali e seme
- See how popular Sonja’s books are with pu, ku, su
Wildcard
You can search for multiple phrases of the same form with a wildcard by
replacing one word other than the first in a phrase with *
. This will
search for the ten most popular phrases that match your search and graph
them.
Examples
Adding phrases
You can add two or more phrases together by putting a +
between them. This
is helpful for combining synonyms, like
ale + ali. You can also use it to compare
multiple related words to another one, like
pan + kili, moku.
It often helps to graph some or all of the summed phrases with the independent phrases so you can see how much the graph has shifted, such as in pan + kili, moku, pan, kili
Examples
- See what portion of toki pona is pure particles: li, e, la, pi, o, en, anu
- Combine multiple ways to write the same word, like ala + x or anu + y, anu, y
- Do that with UCSUR text specifically: toki +
- Combine related words to compare to others: a + n, pona
Subtracting phrases
You can subtract one or more phrases from another by putting a -
between
them. This is helpful when you’d like to omit a specific use of a word, such as
in toki - toki pona. This gives you all
the uses of the word “toki” which are not in the phrase “toki pona”.
Examples
- Determine what grammatical positions a phrase is most common in: tenpo ni - tenpo ni la - lon tenpo ni, tenpo ni
- Examine the use of a word without other conflicting terms, like in san - kekan san
Set Minimum Sentence Length
You can set a minimum sentence length for one term by adding an underscore
_
and a number from 1 to 6 to the end of that term. For example,
toki_1, toki_6 will show you the percentage
of times toki appeared in any sentence, versus the smaller percentage of times
that it was in specifically sentences of length 6. You can do this with phrases
too: ona li, ona li_6
You can use this with subtraction to isolate a word or phrase: The search toki - toki_2 will show you every time “toki” appeared, minus the times it appeared in sentences with 2 or more words- which means you have all the times “toki” was the only word in the sentence. Or, put another way, all the times toki meant “hello”! (except for the times it was an answer to a question, of course!)
Read ahead to the options section on Minimum Sentence Length for more details!
Examples
- Get even more accurate information about greetings:
sina seme - sina seme_3, sina pali e seme - sina pali e seme_5
- See if there is a difference in relative use of words: toki, pona, toki_6, pona_6. (there is! “pona” is more common in short sentences; “toki” is more common in long sentences!)
Options
Note: Scale has its own section, because it’s much more complicated than the other options.
Minimum Sentence Length
By default, All sentences is set, meaning you will see how words or phrases appear in any length of sentence. If you set this option to 3+ words per sentence, you’ll see how words or phrases appear in sentences which have at least 3 words. This can be helpful if you want to study more “substantial” uses of words, i.e. those that appear in longer sentences.
If one of your searches sets the minimum sentence length for a term, pay attention to the legend: If the legend shows the term without an underscore, it means the length you chose was already being searched for anyway. This can happen with a search like toki pona_2 normally, or wawa_3 while the minimum sentence length is set to 3.
This happens to the phrase toki pona with a minimum sentence length of 2 because phrases have a minimum sentence length equal to how many words are in them. That is, “toki pona” can only appear in sentences with at least two words- which I hope makes sense! Because of this, the minimum sentence length is always implicitly set to at least the length of the phrase. You can always set it to be higher, of course.
Note that when graphing on any relative scale, the percentage is derived by dividing the number of occurrences for the search term by the total number of words in the same time period.
If you’re curious why this is done instead of dividing by the number of words in sentences of the appropriate minimum length, here’s a simplified example:
Demonstration
Imagine the following scenario:
- There are 10,000 words
- 9,000 of those words are in sentences with at least 2 words
- 100 of the words are toki
- 90 of those toki are in sentences with at least 2 words
With this data, asking “What percentage of words are toki?” means we get 1% (100 / 10,000), which makes sense, and this is the only reasonable question to measure toki on its own.
However, there are two ways to measure toki in sentences with at least 2 words (toki_2): You could measure it as a portion of all words, or as a portion of words from sentences with at least 2 words. These two different choices have different results: toki_2 is 0.9% (90 / 10,000) of all words, but 1% (90 / 9,000) of the words in sentences with at least 2 words.
We can use this information to determine which answer is best for this graphing tool by exploring what happens when you search toki - toki_2.
In the sample data, graphing toki - toki_2 with toki_2 measured as a portion of all words means we get 0.1%: ((100 - 90) / 10,000). That is, toki appears exactly 10 times on its own in this data. Graphing instead with toki_2 measured as a portion of words in sentences with 2 or more words means we get 0%: ((100 / 10,000) - (90 / 9,000)). In other words, we get a misleading outcome because we’re subtracting two incomparable percentages.
There is value in knowing what portion of sentences with at least 2 words are some specific word. This graphing tool does not offer that information because doing so would produce misleading graphs for both side-by-side comparison and for adding or subtracting specific results. If you’re interested in that alternate data, download the database!
Smoother
By default, Window Avg is set, and I don’t recommend changing it from the default unless you’re aware of what change you’re making and why.
Smoothing
By default, 2 smoothing is set. The number is how many neighbors on both sides of a given data point will be smoothed. For example, if you set 5 smoothing with the Window Avg smoother, it means a given point will be set to the average of the 5 points before, 5 points after, and itself.
Smoothing is helpful for making noisy graphs more readable while preserving the trend line of the original graph. Compare the graphs of wawa, nasa, suwi, sewi, suli with 0 smoothing and 5 smoothing.
Note that smoothing can produce misleading graphs with respect to the time axis, such as smearing periodic phrases over too much time or implying that misikeke was used before November 2019 with specific smoothers. Sometimes, 0 smoothing is better!
Some scales will have smoothing disabled, usually because it wouldn’t make sense to average their values. This applies to the absolute scale, for example, because it is meant to show you the exact number of times a given word or phrase appeared! This also applies to both offered derivatives, because they are completely impervious to localized averaging.
Dates
By default, the date range for the graph is set from August 2016 to August 2024. You can select any start or end you want, but there are some caveats to warn you about:
First, the graph ends in July 2024. This is because I collected this data during August, so it is incomplete after July. For that matter, it is differently incomplete depending on the platform and specific community, because I cannot collect them all simultaneously. The August data is in the database, but it produces misleading graphs to include, so I have omitted it from display.
Second, the default start date is August 2016 because the data prior to that is extremely sparse. I have left in the option to query for that data, but understand that the relative graphs will be noisy, the absolute graphs will be flat, minmax graphs will become nonsense- and for the other graphs, here be dragons.
Lastly, it is important to note that the way I store the data is a potential source of bias. If you’re curious, read the following:
Discussion
In the database, I count words in monthly “buckets.” You can see this on the graph, where each point aligns with a specific month and is labeled as such. This is a straight-forward way to graph and read the data, but it has some disadvantages for interpretability.
Some months are shorter than others, so the absolute scale may be misleading for those periods, implying they were less active than neighboring months. Similarly, weeks are not evenly distributed over months, so some months will have more weekends and therefore more active periods than others. Fortunately, this doesn’t change the relative scale, since that is measured as a percentage of words said.
Still, in the future I would like to change the size of the time “buckets” to be 4-weekly, to reduce or eliminate the above described biases.
Scales
Absolute
The Absolute scale shows the exact number of times a given word or phrase was said in each time period. This is useful for observing trends in the activity of the community, but can make it difficult to compare words which are of very different magnitude, or to study the use of toki pona before March 2020.
If you’re interested in comparing two or more absolute graphs, but one word is vastly more popular than the others, check again using the Absolute Minmax scale.
Relative
The Relative scale is the default scale, and shows you what percentage of all words are the searched term in each time period on the graph. This also applies to phrases, for which you would interpret as follows: “What portion of all words are any one word from this phrase?”
The justification for doing the math this way is similar to that for minimum sentence length, which you can read about below. Spoiler: I dunk on Google.
Discussion
As with minimum sentence length, there is more than one meaningful question that could be asked about phrases. In this case, there are three:
What percentage of all same-length phrases are this phrase?
This method has the same problem of incomparable percentages as seen in minimum sentence lengths: There isn’t a way to compare “toki” and “toki pona” if their percentages are measured against unrelated totals- and “percentage of all words” is not comparable to “percentage of all bigrams”. Strangely, Google Ngrams does this intentionally according to their info page, in spite of the fact that they allow you to add and subtract n-grams of different length. Direct quote, with bold added by me:
What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”?
What percentage of all words are this phrase?
For graphing a single bi-gram, you can get interesting results from this question. To do this, we would multiply the number of times the phrase occurred by the length of the phrase, then divide by the total number of words. This means that, percentage-wise, each phrase counts for all of its words.
This is a problem, which we can see with a search such as toki - toki pona. This math implies we would remove some occurrences of “pona” from occurrences of “toki,” because the phrase “toki pona” accounts for both of its words. There were never any occurrences of “pona” in the occurrences of “toki,” so we would get a lower result than expected.
What percentage of all words are any word in this phrase?
In this method, the search toki - toki pona means to remove all occurrences of “toki pona” from “toki”. Since the phrase “toki pona” is only counted for how many times it appeared, and not for its number of words, the math works out: we are only removing the occurrences of any one word in the phrase “toki pona” from the word “toki”. This method treats phrases as a form of context, so you could interpret that search as follows: Remove all occurrences of “toki” that were in the phrase “toki pona” from occurrences of the word “toki.”
Because the search is now interpretable contextually, we get an unexpected and powerful new capability: The phrase can stand in for any of its words. For example, we could search for pona - toki pona, and the result makes sense: Remove all occurrences of “pona” that appear in “toki pona” from occurrences of “pona.”
Even better, you can replicate the previous method by subtracting a phrase once for each word in it. If we did want to remove every word in ‘toki pona’ from ‘toki’, as in the second method, we can search toki - toki pona - toki pona to get that result.
Last note: If you’re a math academic reading this, I am terribly, terribly sorry for not using real notation to demonstrate the above. I assure you, these plain language explanations are much less embarassing than any attempt I would make to use set notation.
The relative scale is the most generally useful scale, as it implicitly tells you the relationship of your search term to all other terms by showing a percentage instead of a raw number. For example, you can compare the grammatical particles or the colors.
It can be difficult to compare relative graphs for words which are in different magnitudes, such as kijetesantakalu, soweli. I recommend the Relative Minmax scale for cases like these, which can help to identify words which trend in the same way no matter their magnitude.
Cumulative
The Cumulative scale shows you how many times a given word has been said up to a point in time, increasing to the total number of times the word has been said by the present. This is handy for observing the point where a word or phrase becomes more spoken than another, or differently examining periodic phrases. It can also help to clarify the fact that, while one word has become more popular than another in the recent past, it may not have done so all-time
Relative Log
The Relative Log scale is the natural logarithm of the relative scale, and it is useful for comparing trends in words of different magnitudes while preserving their popularity order. For example, kijetesantakalu, soweli are now much closer in scale while not errantly implying that kijetesantakalu exceeded soweli in popularity as Relative Minmax would.
Relative Minmax
The Relative Minmax scale normalizes all the points in the relative graph to be between 0 and 1 while preserving the original curve of the graph. This is excellent for demonstrating that words like ona, li trend similarly with use while being in different magnitudes.
Absolute Minmax
The Absolute Minmax scale normalizes all the points in the absolute graph to be between 0 and 1 while preserving the original curve of the graph. This makes it helpful for demonstrating that toki, meli trend similarly with community activity while being in different mangitudes. This even applies when comparing words and phrases.
Weird Scales
Sorry- these aren’t filled out yet, because I ran out of time while preparing for suno pi toki pona. I’ll come back for them! Rest assured that I do not have anything useful to say about them yet.
Potential Bias
I discussed ways that smoothing and dates can create misleading graphs, but these are not the only forms of bias possible. Most of the remaining bias is in where and how the data is collected.
Bots
Every platform has bots which send messages, but not every platform is gracious enough to inform you that a given message is from a bot. Fortunately, almost no bots are sending messages in toki pona, but there are a handful worth being aware of because they artificially increase the use of any word or phrase they use if counted.
Discord
Discord does an excellent job of informing you that a given user is a bot, but things get mixed up when it comes to webhooks because of PluralKit. Normally, webhooks are a kind of bot message which is sent automatically but is not attached to any user. They tend to be notifications for things happening on other platforms, like commits on Github or posts on Reddit. However, PluralKit messages are webhook messages from users. Right now, I can’t distinguish between PluralKit messages and any other type of webhook, so I’m forced to count both or neither- and I chose to count both. This means there is some uncertain amount of bot data in the Discord data. Fortunately, only 4.6% of the data is from webhooks, so the impact of this cannot be too large.
In the future, I plan to grab all the webhook messages I have and ask the PluralKit API whether those messages are PluralKit messages. Then I would be able to map those messages back to the host account which originally sent the message, and I could then count only user messages.
Telegram
On Telegram, I have no way to know if a given user is a bot. Telegram’s JSON export format does not include that information. I do have one hard-coded exception, the IRC forwarding bot, because I needed to cut the names out of its messages to represent them as intended. Otherwise, all Telegram bots are invisible to me. That said, all of the ones I’ve seen speak English, so they shouldn’t have any counted sentences.
Reddit does not tell you if a given user is a bot as far as I’m aware. Can’t fix it, but like Telegram, none of the bots on Reddit seem to speak toki pona- so no harm done!
Herbevitistoj
Herbevitistoj is “Many people who professionally avoid grass.” It’s a fairly recent and silly way that Esperanto speakers refer to the “terminally online,” or termed more kindly, people who are on the internet a lot. Most toki pona is spoken on the internet in the first place, but there is still a subset of the community which makes up an outsized portion of messages written in toki pona because they are much more active in online spaces. There isn’t anything to be done about this- it’s just a fact to be aware of.
That said, I am personally curious to see what the data would look like if the most active 20% of users were removed from it- maybe something to do in the future.
Platform Notes
Discord
Right now, the data for ilo Muni is collected from Discord, Telegram, and Reddit. Of these, Discord is about 80% of all of the data, and ma pona pi toki pona in particular is about 80% of the data from Discord. You can see the impact of that one server from sections of the graph like this, where a small number of archived channels caused a nearly 50% decline in use of toki pona.
In a sense, Discord is “over-represented” in the data, because it is such a large portion of the data in the first place. For the time being, I have chosen to weight all messages equally, but I would like to produce alternate databases and analyses in the future.
Telegram
As far as I’m aware, I found and successfully exported the entire history of every public toki pona channel on Telegram. No notes.
Unfortunately, the pricing for Reddit’s API was changed drastically in summer 2023. This means it is now either difficult or expensive to collect data from the platform, and the user API only allows scrolling back as much as 1000 posts. Fortunately, /u/raiderbdev has done the archival work and /u/Watchful1 has sorted it, so there is Reddit data available, but that data only goes to the end of 2023. As such, there is no Reddit data during 2024. Additionally, the linked archive data only covers the top 40,000 subreddits, which means it only covers /r/tokipona and not any of the other toki pona subreddits such as /r/mi_lon or /r/tokiponataso.
Identifying toki pona
This data would not exist without first being able to detect whether a message is “in toki pona”. I wrote a library to do this, but it has its own collection of complexities which can impact how you interpret the data.
I chose to use the dictionary from lipu Linku, including its sandbox, in order to to identify definite “toki pona words.” These weren’t the only words I counted, because that would miss anything that weren’t already in my dictionary. But it still isn’t perfect.
Dictionary
If a word in my dictionary matches a word in another language, I would errantly count that word while scoring the sentence. My scoring algorithm has no concept of a penalty currently, so I couldn’t identify that the words around a given one were specifically of some language other than toki pona- I’d just know they didn’t match any of my filters.
This is especially troublesome if a message is a single word in my dictionary, or otherwise very short. “je” is borderline non-existent in toki pona, but it’s in the dictionary, and it’s a first person pronoun in French!
Short words
While writing my sentence scoring algorithm, I noticed that there were tons of 1-2 letter words in other languages which would errantly match my scoring filter, and thus errantly raise the score of messages that were not in toki pona.
To fix this, I changed my syllable checking and alphabetic match checking filters to only score if a given word has at least three letters.
For the most part, this is a good assumption: two letter words are rare enough that I almost certainly have them all in my dictionary already. Doubly so for one letter words, since there are only six possible (a, e, i, o, u, n).
But this means that, if a two letter word were coined, and it were not added to my dictionary, I would score it zero by default- it could still be discovered if it were next to many other toki pona words, but it would drag down the score of the sentence it’s in.
Frequently Asked Questions
Why does it take so long to get a graph?
Searching for one or a few specific terms should only take a few seconds, but to be real: You’re reading a big SQLite blob over the network, making requests to it one by one. I can only speed this up so much.
If I had a database hosting solution, nearly all of the queries would be as instant as the network itself. If you have any suggestions for one, let me know!
Why is the graph empty?
This is an error handling issue. You made a search with a malformed term and I didn’t catch and throw it out. Oops. I already know why this is happening, but I’m going to fix this later. Feel free to open an issue though!
Why is my subtraction negative?
That’s allowed! If you do tawa pona - kama pona, you’ll get a graph which is mostly negative. This means the phrase “kama pona” is more common than the phrase “tawa pona”, probably because the community is very welcoming!
It’s probably possible to get floating point silliness when subtracting, but I haven’t seen that happen personally- please reach out if you spot it!
Note: the y-axis scale may sometimes display in scientific notation instead of integers or rationals. This is not an error! That’s just ChartJS!
Why is the data so noisy before 2020?
In short, there is much less data to examine from before 2020. This is why I set the default start date to August 2016, rather than the actual start of my data in August 2010.
So the next question is, why 2020? Although I probably don’t need to answer that, I’ll go ahead and do so:
When everyone was trapped indoors for some two years during the COVID-19 pandemic, toki pona saw a huge spike in popularity. You can see the climb in activity in every word when the scale is set to “absolute”. This also affects the relative graph though- before 2020, each word written is a much larger portion of all the words for that time period! To help, you can add smoothing to relative mode, which will average out
Why is there a huge spike on [date] for [word]?
This data isn’t from professional sources, unlike Google Ngrams which is sourced entirely from published books. In professional sources, you wouldn’t expect an editor to let a paragraph like woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! remain in the final product. But in Discord and any other social media platform, there is no editorial oversight- silly goofy abounds. This affects mu, wan, tu, luka, mute, ale, and probably others, which you can see here. If you spot others, let me know!
Relatedly, there was a time during development where “mu” had a spike to over 40,000 uses because of a day in which a handful of messages were nothing but “mu” to the text limit of Discord. Because of this, I added a nonsense filter to skip sentences before they get counted, which works like so: If a sentence is more than 10x the average sentence length (4.13557) and more than 50% a single word, it gets thrown out. Similarly, if a sentence is more than 100x the average sentence length, it gets thrown out immediately. This filter isn’t perfect though, because somebody can say “mu. mu. mu.” to similar effect, and each of those will be counted as individual sentences. Working on it!
Ultimately, I would like this data to reflect how toki pona is actually used- and, granted, somebody sending hundreds of “mu” is using toki pona. But I think most people would agree that those messages are not reflective of how toki pona is used, and they otherwise make it difficult to examine the rest of the data.