Help! for ilo Muni

This page is about how to use and understand ilo Muni, which you can work through like a tutorial or a manual. If you want to know how or why ilo Muni exists, or want to talk to me, see the about page!

There’s also a quick reference under the help button on the main page.

Table of Contents

Words

Search for a toki pona word such as pona. If it appeared at least 40 times across all the places and times I checked, you’ll get a graph showing how that word has been used over time! By default, you’ll see what percentage of all words were the searched word with monthly datapoints.

You can also search for proper names like Sonko, Inli, or Siko. There are lots of very rare words in the database too. You might be surprised by what you find, so try lots of things!

Examples

Phrases

You can search for up to 6 words in one phrase, such as mi kama sona e toki pona. Note that the more words are in a phrase, the fewer times it is likely to appear- so don’t be surprised if you don’t get a result. Try searching for shorter phrases, like kama pona or anu seme.

Examples

Multiple Searches

You can graph multiple phrases at once by separating them with commas ,, even mixing words and phrases: toki, pona, toki pona. Often, different words or phrases will have very different amounts of use, resulting in it being hard to read more than the top one or two graphs. You can address that with alternate scales later in this tutorial.

Examples

Wildcard

You can search for multiple phrases of the same form with a wildcard by replacing one word other than the first in a phrase with *. This will search for the ten most popular phrases that match your search and graph them.

Examples

Adding phrases

You can add two or more phrases together by putting a + between them. This is helpful for combining synonyms, like ale + ali. You can also use it to compare multiple related words to another one, like pan + kili, moku.

It often helps to graph some or all of the summed phrases with the independent phrases so you can see how much the graph has shifted, such as in pan + kili, moku, pan, kili

Examples

Subtracting phrases

You can subtract one or more phrases from another by putting a - between them. This is helpful when you’d like to omit a specific use of a word, such as in toki - toki pona. This gives you all the uses of the word “toki” which are not in the phrase “toki pona”.

Examples

Minimum Sentence Length

You can set a minimum sentence length for one term by adding an underscore _ and a number from 1 to 6 to the end of that term. For example, toki_1, toki_6 will show you the percentage of times toki appeared in any sentence, versus the smaller percentage of times that it was in specifically sentences of length 6. You can do this with phrases too: ona li, ona li_6

You can use this with subtraction to isolate a word or phrase: The search toki - toki_2 will show you every time “toki” appeared, minus the times it appeared in sentences with 2 or more words- which means you have all the times “toki” was the only word in the sentence. Or, put another way, all the times toki meant “hello”! (except for the times it was an answer to a question, of course!)

Read ahead to the options section on Minimum Sentence Length for more details!

Examples

Options

Note: Scale has its own section, because it’s much more complicated than the other options.

Smoothing

By default, 2 smoothing is set. The number is how many neighbors on both sides of a given data point will be smoothed. For example, if you set 5 smoothing with the Window Avg smoother, it means a given point will be set to the average of the 5 points before, 5 points after, and itself.

Smoothing is helpful for making noisy graphs more readable while preserving the trend line of the original graph. Compare the graphs of wawa, nasa, suwi, sewi, suli with 0 smoothing and 5 smoothing.

Note that smoothing can produce misleading graphs with respect to the time axis, such as smearing periodic phrases over too much time or implying that misikeke was used before November 2019 with specific smoothers. Sometimes, 0 smoothing is better!

Some scales will have smoothing disabled, usually because it wouldn’t make sense to average their values. This applies to the absolute scale, for example, because it is meant to show you the exact number of times a given word or phrase appeared! This also applies to both offered derivatives, because they are completely impervious to localized averaging.

Dates

By default, the date range for the graph is set from August 2016 to August 2024. You can select any start or end you want, but there are some caveats to warn you about:

First, the graph ends in July 12th 2024, and that datapoint covers that day through August 7th 2024. The graph ends there because I collected this data during August, so the incomplete data from August 8th onwards is not present in the database.

Second, the default start date is August 2016 because the data prior to that is extremely sparse. I have left in the option to query for that data, but understand that relative graphs will be noisy, the absolute graphs will be flat, minmax graphs will become nonsense- and for the other graphs, here be dragons.

Historical note

In databases published prior to September 7th, 2024, I count words in monthly “buckets.” If you study a graph from then, or download the older database to graph, you’ll see each point aligns with a specific month and is labeled as such. This is a straight-forward way to graph and read the data, but it has some disadvantages for interpretability.

Some months are shorter than others, so the absolute scale may be misleading for those periods, implying they were less active than neighboring months. Similarly, weeks are not evenly distributed over months, so some months will have more weekends and therefore more active periods than others. Fortunately, this doesn’t change the relative scale, since that is measured as a percentage of words said.

Still, in the future I would like to change the size of the time “buckets” to be 4-weekly, to reduce or eliminate the above described biases.

Minimum Sentence Length

Note: This is hidden by default! Click to show it.

This option is also called words per sentence in its dropdown. By default, All sentences is set, meaning you will see how words or phrases appear in any length of sentence. If you set this option to 3+ words per sentence, you’ll see how words or phrases appear in sentences which have at least 3 words. This can be helpful if you want to study more “substantial” uses of words, i.e. those that appear in longer sentences.

If one of your searches sets the minimum sentence length for a term, pay attention to the legend: If the legend shows the term without an underscore, it means the length you chose was already being searched. This can happen with a search like toki pona_2 normally, or wawa_3 while the minimum sentence length dropdown is set to 3.

This happens to the phrase toki pona with a minimum sentence length of 2 because phrases have a minimum sentence length equal to how many words are in them. That is, “toki pona” can only appear in sentences with at least two words- which I hope makes sense! Because of this, the minimum sentence length is always implicitly set to at least the length of the phrase. You can always set it to be higher, of course.

Note that when graphing on any relative scale, the percentage is derived by dividing the number of occurrences for the search term by the total number of words in the same time period.

If you’re curious why this is done instead of dividing by the number of words in sentences of the appropriate minimum length, here’s a simplified example:

Demonstration

Imagine the following scenario:

With this data, asking “What percentage of words are toki?” means we get 1% (100 / 10,000), which makes sense, and this is the only reasonable question to measure toki on its own.

However, there are two ways to measure toki in sentences with at least 2 words (toki_2): You could measure it as a portion of all words, or as a portion of words from sentences with at least 2 words. These two different choices have different results: toki_2 is 0.9% (90 / 10,000) of all words, but 1% (90 / 9,000) of the words in sentences with at least 2 words.

We can use this information to determine which answer is best for this graphing tool by exploring what happens when you search toki - toki_2.

In the sample data, graphing toki - toki_2 with toki_2 measured as a portion of all words means we get 0.1%: ((100 - 90) / 10,000). That is, toki appears exactly 10 times on its own in this data. Graphing instead with toki_2 measured as a portion of words in sentences with 2 or more words means we get 0%: ((100 / 10,000) - (90 / 9,000)). In other words, we get a misleading outcome because we’re subtracting two incomparable percentages.

There is value in knowing what portion of sentences with at least 2 words are some specific word. This graphing tool does not offer that information because doing so would produce misleading graphs for both side-by-side comparison and for adding or subtracting specific results. If you’re interested in that alternate data, download the database!

Smoother

Note: This is hidden by default! Click to show it.

By default, Window Avg is set, and I don’t recommend changing it from the default unless you’re aware of what change you’re making and why.


Scales

Absolute

The Absolute scale shows the exact number of times a given word or phrase was said in each time period. This is useful for observing trends in the activity of the community, but can make it difficult to compare words which are of very different magnitude, or to study the use of toki pona before March 2020.

If you’re interested in comparing two or more absolute graphs, but one word is vastly more popular than the others, check again using the Absolute Minmax scale.

Relative

The Relative scale is the default scale, and shows you what percentage of all words are the searched term in each time period on the graph. This also applies to phrases, for which you would interpret as follows: “What portion of all words are any one word from this phrase?”

The justification for doing the math this way is similar to that for minimum sentence length, which you can read about below. Spoiler: I dunk on Google.

Discussion

As with minimum sentence length, there is more than one meaningful question that could be asked about phrases. In this case, there are three:

What percentage of all same-length phrases are this phrase?

This method has the same problem of incomparable percentages as seen in minimum sentence lengths: There isn’t a way to compare “toki” and “toki pona” if their percentages are measured against unrelated totals- and “percentage of all words” is not comparable to “percentage of all bigrams”. Strangely, Google Ngrams does this intentionally according to their info page, in spite of the fact that they allow you to add and subtract n-grams of different length. Direct quote, with bold added by me:

What the y-axis shows is this: of all the bigrams contained in our sample of books written in English and published in the United States, what percentage of them are “nursery school” or “child care”?

What percentage of all words are this phrase?

For graphing a single bi-gram, you can get interesting results from this question. To do this, we would multiply the number of times the phrase occurred by the length of the phrase, then divide by the total number of words. This means that, percentage-wise, each phrase counts for all of its words.

This is a problem, which we can see with a search such as toki - toki pona. This math implies we would remove some occurrences of “pona” from occurrences of “toki,” because the phrase “toki pona” accounts for both of its words. There were never any occurrences of “pona” in the occurrences of “toki,” so we would get a lower result than expected.

What percentage of all words are any word in this phrase?

In this method, the search toki - toki pona means to remove all occurrences of “toki pona” from “toki”. Since the phrase “toki pona” is only counted for how many times it appeared, and not for its number of words, the math works out: we are only removing the occurrences of any one word in the phrase “toki pona” from the word “toki”. This method treats phrases as a form of context, so you could interpret that search as follows: Remove all occurrences of “toki” that were in the phrase “toki pona” from occurrences of the word “toki.”

Because the search is now interpretable contextually, we get an unexpected and powerful new capability: The phrase can stand in for any of its words. For example, we could search for pona - toki pona, and the result makes sense: Remove all occurrences of “pona” that appear in “toki pona” from occurrences of “pona.”

Even better, you can replicate the previous method by subtracting a phrase once for each word in it. If we did want to remove every word in ‘toki pona’ from ‘toki’, as in the second method, we can search toki - toki pona - toki pona to get that result.

Last note: If you’re a math academic reading this, I am terribly, terribly sorry for not using real notation to demonstrate the above. I assure you, these plain language explanations are much less embarassing than any attempt I would make to use set notation.

The relative scale is the most generally useful scale, as it implicitly tells you the relationship of your search term to all other terms by showing a percentage instead of a raw number. For example, you can compare the grammatical particles or the colors.

It can be difficult to compare relative graphs for words which are in different magnitudes, such as kijetesantakalu, soweli. I recommend the Relative Minmax scale for cases like these, which can help to identify words which trend in the same way no matter their magnitude.

Cumulative

The Cumulative scale shows you how many times a given word has been said up to a point in time, increasing to the total number of times the word has been said by the present. This is handy for observing the point where a word or phrase becomes more spoken than another, or differently examining periodic phrases. It can also help to clarify the fact that, while one word has become more popular than another in the recent past, it may not have done so all-time

Relative Log

The Relative Log scale is the natural logarithm of the relative scale, and it is useful for comparing trends in words of different magnitudes while preserving their popularity order. For example, kijetesantakalu, soweli are now much closer in scale while not errantly implying that kijetesantakalu exceeded soweli in popularity as Relative Minmax would.

Relative Minmax

The Relative Minmax scale normalizes all the points in the relative graph to be between 0 and 1 while preserving the original curve of the graph. This is excellent for demonstrating that words like ona, li trend similarly with use while being in different magnitudes.

Absolute Minmax

The Absolute Minmax scale normalizes all the points in the absolute graph to be between 0 and 1 while preserving the original curve of the graph. This makes it helpful for demonstrating that toki, meli trend similarly with community activity while being in different mangitudes. This even applies when comparing words and phrases.


Weird Scales

Sorry- these aren’t filled out yet, because I ran out of time while preparing for suno pi toki pona. I’ll come back for them! Rest assured that I do not have anything useful to say about them yet.

Potential Bias

I discussed how smoothing can create misleading graphs, but this is not the only forms of bias possible. Most of the remaining bias is in where and how the data is collected.

Bots

Every platform has bots which send messages, but not every platform is gracious enough to inform you that a given message is from a bot. Fortunately, almost no bots are sending messages in toki pona, but there are a handful worth being aware of because they artificially increase the use of any word or phrase they use if counted.

Discord

Discord does an excellent job of informing you that a given user is a bot, but things get mixed up when it comes to webhooks because of PluralKit. Normally, webhooks are a kind of bot message which is sent automatically but is not attached to any user. They tend to be notifications for things happening on other platforms, like commits on Github or posts on Reddit. However, PluralKit messages are webhook messages from users. Right now, I can’t distinguish between PluralKit messages and any other type of webhook, so I’m forced to count both or neither- and I chose to count both. This means there is some uncertain amount of bot data in the Discord data. Fortunately, only 4.6% of the data is from webhooks, so the impact of this cannot be too large.

In the future, I plan to grab all the webhook messages I have and ask the PluralKit API whether those messages are PluralKit messages. Then I would be able to map those messages back to the host account which originally sent the message, and I could then count only user messages.

Telegram

On Telegram, I have no way to know if a given user is a bot. Telegram’s JSON export format does not include that information. I do have one hard-coded exception, the IRC forwarding bot, because I needed to cut the names out of its messages to represent them as intended. Otherwise, all Telegram bots are invisible to me. That said, all of the ones I’ve seen speak English, so they shouldn’t have any counted sentences.

Reddit

Reddit does not tell you if a given user is a bot as far as I’m aware. Can’t fix it, but like Telegram, none of the bots on Reddit seem to speak toki pona- so no harm done!

Herbevitistoj

Herbevitistoj is “Many people who professionally avoid grass.” It’s a fairly recent and silly way that Esperanto speakers refer to the “terminally online,” or termed more kindly, people who are on the internet a lot. Most toki pona is spoken on the internet in the first place, but there is still a subset of the community which makes up an outsized portion of messages written in toki pona because they are much more active in online spaces. There isn’t anything to be done about this- it’s just a fact to be aware of.

That said, I am personally curious to see what the data would look like if the most active 20% of users were removed from it- maybe something to do in the future.

Platform Notes

No notes for Telegram, Reddit, YouTube, or the Toki Pona forums; all of them are represented in their entirety, or as much entirety as can be reasonably obtained, through the final represented date of ilo Muni.

Discord

Right now, the data for ilo Muni is collected from Discord, Telegram, and Reddit. Of these, Discord is about 80% of all of the data, and ma pona pi toki pona in particular is about 80% of the data from Discord. You can see the impact of that one server from sections of the graph like this, where a small number of archived channels caused a nearly 50% decline in use of toki pona.

In a sense, Discord is “over-represented” in the data, because it is such a large portion of the data in the first place. For the time being, I have chosen to weight all messages equally, but I would like to produce alternate databases and analyses in the future.

Identifying toki pona

This data would not exist without first being able to detect whether a message is “in toki pona”. I wrote a library to do this, but it has its own collection of complexities which can impact how you interpret the data.

I chose to use the dictionary from lipu Linku, including its sandbox, in order to to identify definite “toki pona words.” These weren’t the only words I counted, because that would miss anything that weren’t already in my dictionary. But it still isn’t perfect.

Dictionary

If a word in my dictionary matches a word in another language, I would errantly count that word while scoring the sentence. My scoring algorithm has no concept of a penalty currently, so I couldn’t identify that the words around a given one were specifically of some language other than toki pona- I’d just know they didn’t match any of my filters.

This is especially troublesome if a message is a single word in my dictionary, or otherwise very short. “je” is borderline non-existent in toki pona, but it’s in the dictionary, and it’s a first person pronoun in French!

Short words

While writing my sentence scoring algorithm, I noticed that there were tons of 1-2 letter words in other languages which would errantly match my scoring filter, and thus errantly raise the score of messages that were not in toki pona.

To fix this, I changed my syllable checking and alphabetic match checking filters to only score if a given word has at least three letters.

For the most part, this is a good assumption: two letter words are rare enough that I almost certainly have them all in my dictionary already. Doubly so for one letter words, since there are only six possible (a, e, i, o, u, n).

But this means that, if a two letter word were coined, and it were not added to my dictionary, I would score it zero by default- it could still be discovered if it were next to many other toki pona words, but it would drag down the score of the sentence it’s in.


Frequently Asked Questions

Why does [query] take so long?

While searching for one or a few terms should only take about half a second, searching with a wildcard or for many terms will take much longer- you can estimate this by multiplying the half second by the number of queries.

That said, even large queries will become faster as you make more queries in the same session. This is because you’re downloading and caching more of the index needed to fulfill queries!

But this is limited by the fact that each part of each query must be fetched consecutively, including for different queries. There is only one worker fetching data, because multiple workers would be unable to share their cache.

If I had a database hosting solution, nearly all of the queries would be as instant as the network itself. If you have any suggestions for one, let me know!

Why is my subtraction negative?

That’s allowed! If you do tawa pona - kama pona, you’ll get a graph which is mostly negative. This means the phrase “kama pona” is more common than the phrase “tawa pona”, probably because the community is very welcoming!

It’s probably possible to get floating point silliness when subtracting, but I haven’t seen that happen personally- please reach out if you spot it!

Why is the data so noisy before 2020?

In short, there is much less data to examine from before 2020. This is why I set the default start date to August 2016, rather than the actual start of my data in August 2010.

So the next question is, why 2020? Although I probably don’t need to answer that, I’ll go ahead and do so:

When everyone was trapped indoors for some two years during the COVID-19 pandemic, toki pona saw a huge spike in popularity. You can see the climb in activity in every word when the scale is set to “absolute”. This also affects the relative graph though- before 2020, each word written is a much larger portion of all the words for that time period! To help, you can add smoothing to relative mode, which will average out

Why is there a huge spike on [date] for [word]?

This data isn’t from professional sources, unlike Google Ngrams which is sourced entirely from published books. In professional sources, you wouldn’t expect an editor to let a paragraph like woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! woo yeah! remain in the final product. But in Discord and any other social media platform, there is no editorial oversight- silly goofy abounds. This affects mu, wan, tu, luka, mute, ale, and probably others, which you can see here. If you spot others, let me know!

Relatedly, there was a time during development where “mu” had a spike to over 40,000 uses because of a day in which a handful of messages were nothing but “mu” to the text limit of Discord. Because of this, I added a nonsense filter to skip sentences before they get counted, which works like so: If a sentence is more than 10x the average sentence length (4.13557) and more than 50% a single word, it gets thrown out. Similarly, if a sentence is more than 100x the average sentence length, it gets thrown out immediately. This filter isn’t perfect though, because somebody can say “mu. mu. mu.” to similar effect, and each of those will be counted as individual sentences. Working on it!

Ultimately, I would like this data to reflect how toki pona is actually used- and, granted, somebody sending hundreds of “mu” is using toki pona. But I think most people would agree that those messages are not reflective of how toki pona is used, and they otherwise make it difficult to examine the rest of the data.