About ilo Muni

This page is about what ilo Muni is, and how and why it exists, as well as thanks for everyone who helped make it possible. If you want to know how to use ilo Muni, check the help page!

Feel free to skip around. If you don’t care about a section, that’s totally okay!

Table of Contents

My question isn’t here!

Feel free to email me! You can also reach out on other platforms, but email is easiest for me to see and respond to (because it is the quietest way people try to talk to me).

Why did you make ilo Muni?

Back in January 2024, several members of the toki pona community came together with the same idea all at once: We should write a proposal to get sitelen pona included in Unicode! From there, it turned out that jan Lepeka had experience writing proposals for Unicode and had already written a massive portion of a sitelen pona proposal. Many others signed on to help, so we all got to work right away.

This has made a lot of people very angry and been widely regarded as a bad move This turned out to be challenging for many reasons beyond the documentation of how toki pona and sitelen pona are used.

First, many of the members of the workgroup did not agree on what words should be encoded or why. We examined the Linku usage data to get a reasonable answer for what words to encode- that took a while and had its own complexities, but it did work out at the time.

Unfortunately, there was no comparable data for sitelen pona glyphs. While many were obvious because most words only have one glyph, over a dozen words have multiple glyphs with different levels of use- especially when the words themselves are low use. We tried running an ad-hoc survey to study glyph usage, but this simultaneously revealed how limited our perspective was and how difficult it would be to answer our original question.

While examining the survey data, we found we had failed to offer at least one glyph as an option, resulting in a large number of write-ins for a version of “olin” where the hearts overlap rather than stack. Then we observed that the results for some glyphs were far higher than expected or anecdotally observed, particularly a variant of “linluwi” where the three circles of “kulupu” are connected with lines.

This, coupled with a later survey about whether certain glyphs are “distinct”, led us down a brand new rabbit hole: A large number of survey respondents seemed to answer how they would like to use toki pona, or how they think they will use it in the future, rather than how they do so now. Several members of the workgroup, including myself, could not reconcile the reported usage from the surveys with our observations.

By March, we hadn’t come to any agreement about what glyphs to use for some words, and the prior agreement about what words to encode in the first place had fallen apart because of the observed issues with the two glyph surveys. At that point, we chose to pivot: What if we directly studied usage?

We submitted a preliminary proposal with the only set of words we could all agree must be in the proposal, that being the nimi pu plus tonsi. It has not been added to the Unicode document register as of writing, but Unicode has acknowledged it!

After this, the work left to be done was much more open-ended. I started improving upon my Discord bot, ilo pi toki pona taso. At the time, it had a simple way to detect whether or not you were speaking toki pona, so that it could remind you to speak toki pona if you didn’t. But when I say simple, I do mean simple- it could be evaded by capitalizing the first word of every sentence, or carefully dodged by using only the 14 letters in Toki Pona’s alphabet, or ignored entirely by quoting your text. This wouldn’t do, since my goal was to fetch all the data I could from as many communities as I could and examine only the messages that are “in toki pona.”

From there, I created a parsing library called sona toki, a script called sona mute for counting all the words in toki pona sentences and stuffing it into a database, and this tool, ilo Muni. The work you’re reading now has been five months in the making!

Where is the data from?

Anywhere toki pona is written in text, so long as there is a date associated with every message and the community is open to the public. Right now, I support six platforms:

Discord

Discord makes up the majority of all written toki pona. For that matter, ma pona pi toki pona is nearly the majority of all written toki pona on its own. As such, this data is the most important to have a look at- but Discord is among the least export-friendly platforms on the internet. Discord does not offer any native functionality to export messages. Fortunately, this excellent project enables you to fetch all messages you have access to in any server you’re in, including in threads, and with tons of metadata per message!

The main challenge was actually in finding all of the communities to fetch- I could have settled for just the large ones, since they’d represent most of the data and be easy to find, but I chose to hunt down as many as I could. I ended up finding over 120 servers, and I’m certain there are more!

Telegram

Telegram offers an “export chat” function directly in its desktop UI, and this dumps every message you can see from the start of the chat’s existence to the present. Perfect! I spent some time searching to find all the communities I could, then asking around in other communities. Once I was done, I listed them here too.

After that, jan Pensa (@spencjo) helped me out by exporting two particular chats which were special cases:

The format for Telegram messages is a bit odd in their official exports, and I can’t tell what users are bots from the exports alone- but other than that, Telegram was refreshingly easy to add to this list.

Reddit

Reddit moved its API behind a paywall long before I began this project, so it is no longer reasonable to “officially” scrape Reddit. They also limited scrollback on all endpoints in the user API to only 1000 items, meaning unofficial scraping would have to be done live in order to be sure you fetched everything. I obviously wasn’t doing my own live scraping of Reddit, but thanks to the incredible work of Pushshift, /u/raiderbdev, and /u/Watchful1, I was able to fetch /r/tokipona and a dozen smaller subreddits including /r/mi_lon, /r/tokiponataso, /r/tokiponaunpa, and /r/liputenpo.

Previously, I was limited to just /r/tokipona through the end of 2023, but /u/Watchful1 reached out and created a dump for all these communities through the end of July 2024. Enormous thanks!

YouTube

Thanks to yt-dlp, it is shockingly easy to make archives of your favorite YouTube videos. To my surprise, it also has the capability to download comments without downloading the associated videos other than their metadata, so I fetched everything I could find related to Toki Pona! However, there is no “toki pona community” on YouTube, because YouTube doesn’t have community structures. There are channels and videos- and for most purposes, that’s it.

Fortunately, there are a few playlists on YouTube such as this one and this one which collect huge lists of known toki pona videos. With this, plus a few obvious search terms like “toki pona” and “kijetesantakalu,” I collected an initial list of videos and their associated authors. Since the list of authors represented almost exclusively those who had, at some point, uploaded at least one Toki Pona video, I added all of them to a separate list- and then I downloaded every video from each of these channels. This functionally guarantees an extremely high degree of coverage.

forums.tokipona.org

To an outsider, it may seem odd to want to include a specific singular forum in this data- but this forum is important because it was active from 1 Oct 2009 to mid-2020, and was one of the only active spaces during most of that time. As such, this forum is highly important to the history of toki pona.

I managed to create a backup of this data using wget, specifically its recursive download feature. Normally I’d prefer to have a more stripped down and structured backup, such as having everything packaged into JSON. However, the forum is closed to new accounts, and has received only a dozen posts in the past two years; it will likely remain up as an archive rather than ever become an active space again. Because of that, having a large archive is okay; it won’t get any larger.

Yahoo group

The toki pona yahoo group was one of very few spaces where toki pona was spoken before 2010. It was active from March 2002 until October 1st 2009, and I’m only aware of three other communities that existed at all during this time. The IRC channel could have been more popular, but it wasn’t preserved beyond a handful of specific conversations that I’m aware of. Fortunately, the entire yahoo group is backed up on the forum above, so including the forum also means including the Yahoo group!

As a fair warning, the formatting of messages in the Yahoo group backup is rough. Many newlines are missing, presumably caused by whatever software was used to copy the data to the forum. For that matter, messages are formatted inconsistently due to the unstable nature of email from provider to provider.

However, my preprocessing in sona mute was more than capable of handling the inconsistent formatting. It would be nice to go back to this data to improve the formatting and squeeze a bit more accuracy out of it as a result, but it’s adequate as-is.

Will you add more communities?

Yes! The main barrier to adding more communities is being able to download data from the given platform. For all of the communities I support, there was a relatively easy way to obtain the messages from the platform or from a pre-existing archive of the platform. For everything I don’t yet support, there isn’t an easy way to download the data from the platform, or a pre-existing archive to download.

Facebook

There are several toki pona communities on Facebook, here, here, here, and here. The majority of their activity is in a period similar to that of Discord- that is, from 2020 onward- but they have much more pre-2020 activity than most other communities that existed around that time. Unfortunately, scraping data from Facebook is extremely difficult- the handful of open source scrapers that exist are variously inconsistent, low quality, or unsupported.

Tumblr

There are a surprising number of toki pona blogs and toki pona posts on Tumblr, which would be excellent to include! During my searching, I identified around 700 blogs that had posted about toki pona at any point. Additionally, toki pona activity on Tumblr goes back as far as 2016, with a major uptick in 2020 and 2021.

However, downloading the data to include it seems extremely difficult. There is a tool called TumblThree which purports to let you download Tumblr and Twitter data, but my experience is that it gets rate limited aggressively by both platforms- even with the slowest settings.

LiveJournal

There are at least two LiveJournal blogs that focused on toki pona, here and here, which were active in a similar time period to the forum or yahoo group. They’re both small, but anything counts, especially for the history of toki pona before 2016. Unfortunately, all the LiveJournal archiving tools I can find are for improving the personal data export feature, not for scraping the site.

toki.social (Mastodon)

toki.social is a Mastodon instance for toki pona! It’s been around since early 2022. Not much else to say; it’s a lovely place, although it’s fairly quiet.

kulupu.pona.la

kulupu.pona.la was a forum hosted by mazziechai which closed abruptly in November 2023 due to trolling. The forum was archived fully by Mazzie before the shutdown, but the format is pure HTML, making it a difficult to get the necessary data out of it.

Will you update the data in the future?

Yes! I plan to update this data at least once per year, but doing so two or three times isn’t out of the question. Collecting, parsing, and counting up all of the data is not labor intensive- writing the code to do all that was, but most of that is done.

That said, there isn’t much value in updating more than once per year. Google Ngrams only updates about every three years. Trends in language don’t generally happen in weeks or months- even at the community’s current size, these trends take years.

What can I do with ilo Muni?

Can I use the data in my project?

Sure! The database is distributed under the terms of the CC BY-SA 4.0 License. In short, you can use the data for anything you like, but you need to attribute me when you do so, and any derivative works made from the data must use the same license. See the linked deed and associated license terms for more details.

Can I cite ilo Muni in my study?

Yes! I recommend including the access date for the graphing tool, because I update the primary dataset periodically with improvements or additions. The dataset is already dated.

Here are some samples in various citation formats:

MLA 9

APA 7

How can I contribute?

You can contribute to ilo Muni here, its preprocessing library here, and its frequency counting library here!

However, code is not the only way to help- if you know of a Toki Pona community that I don’t, and especially if you already have an archive of that data, please email me or open an issue.

Why “ilo Muni?”

Several reasons! I was originally going to name this project “lipu mute”, “multiplicity document.” However, I realized near the end of July that if this tool had a name, I would be able to search its name in this tool! This was too cool to pass up.

From there, choosing a name was easy. I wanted to name it in toki pona, and have the sitelen pona of the name be a phrase which also describes itself. This tool shows how many words there are, and also the frequency of words, which are respectively “mute nimi” and “nimi mute”. Taking the first syllables of each word, you get “Muni” and “Nimu.”

I ran a poll with both, and the results were very close (with a slight preference for Muni!), so I decided to search both names in the version of ilo Muni that existed at that time- “Nimu” had just over 40 results, but “Muni” didn’t come up, so I went with it. Here we are!

Thank you to…