My latest project at Nulab was to migrate over the old Developer Website (containing API documentation, mostly) from Drupal to something new. The old site had become somewhat clumsy to manage, and we needed a new solution.
After initially considering WordPress, we ultimately landed on Hugo, a static site generator (SSG) written in Go. An SSG was perfect for our use-case. But migrating the data was non-trivial. But first things first…
hugo
itself.We looked at Jekyll and Hexo but Hugo won out.
So, again, this site would need to display the API documentation for Nulab’s 3 current products. So, 3 APIs = 3 datasets. Unfortunately, each dataset was stored differently. The data for one app, Backlog, was written in a proprietary markup format that had to be converted to a form that Hugo understood. Namely: Markdown.
We needed to extract the data from Backlog before anything else. The data was stored as a huge text file (essentially). Backlog’s wiki syntax is very similar to Markdown. This means that basic conversions to Markdown should be easy to do using Javascript’s native replace
method with regular expressions.
Backlog Wiki Syntax | Markdown |
---|---|
{code} some code {/code} |
`some code` |
‘italic text’ | *italic text* |
‘‘italic text’’ | **bold text** |
[Google](https://google.com) |
[[Google:https://google.com]] |
First things, I had to import some libraries to help me.
const fs = require('fs');
const path = require('path');
const chalk = require('chalk');
const readline = require('readline');
fs
to manage files. path
for managing file paths easily. chalk
allows for colorized messages in the terminal which can improve readability and aid in debugging. readline
allows us to a read a file line-by-line.
Now, for those regular expression replace
ments:
let result = data.replace(/\*/g, '#');
// Convert {code} to ``
result = result.replace(/({code})/g, '```');
result = result.replace(/({\/code})/g, '```\n');
// Convert [[a:b]] to [a](b)
result = result.replace(/(\[\[)([^\:]+)\:(([^\s]+))(\]\])/g, '[$2]($3)');
result = result.replace(/(&br;)/g, "<br>");
A few more of these later, I was essentially done. Backlog syntax was successfully converting to Markdown.
The next step was decisively harder, but to understand why it was necessary, we first have to review the concept of Front Matter.
Front-Matter is a way to include metadata in a Markdown file by having a little bit of yaml
, json
, or toml
at the top of the file.
Here’s an example in TOML
, which is Hugo’s default:
title = "Hello World"
author = "Kevin Oh"
and the same in JSON
:
{
"title" : "Hello World!",
"author" : "Kevin Oh"
}
If we were going to use Hugo for our site, it meant we needed to convert all of our content to Markdown files, and populate each file with metadata relevant to that particular section. For example, let’s say we want to create an API reference page for someFancyAPIAction
. We would need to create a Markdown file with some title, some-fancy-api-action.md
, with content like so:
+++
title = "Some Fancy API Action"
app = "AwesomeApp"
+++
# Some Fancy API Action
This is where I could discuss the API action/method/endpoint/whatever.
Okay, so front-matter is a way of easily adding metadata to a file.
Let’s recap on what needs to be done.
We’re faced with the following tasks:
The fs
and readline
modules were a godsend here. But before diving into actual code, I printed out 3 pages of the API documentation to learn its structure and to note any irregular patterns.
I cautiously made an initial assessment: we would need to parse the entire file line-by-line, gradually “building up each page” and then “snipping off” each piece from the larger file by saving it (with the fs
module).
But then came the wrench in the cogs - these files couldn’t just naively be “snipped” and saved in a folder. Information had to be extracted from the text as it was being parsed so that the front-matter could be appended to the top of the page. We weren’t just dividing up the text file, we were adding to it and then saving them.
The task was now to: parse the huge file line-by-line, extract all relevant data, reassemble the file with front-matter prepended to the top, and then save it in the correct directory. Quite a different task.
We needed some sort of parser, so I wrote an elementary one. It felt hacked together: a script flush with conditional statements, pointer/tracking variables, etc., but it worked. The algorithm went something like:
#
, then:We needed to build a semantic understanding of the text. I could easily and neatly track this data by creating a JS object to represent each page, which could then be referenced to build the front-matter.
Two days of work later, it works. I create a script that, when run, takes an input file, an output format (Markdown or JSON), and spits out the data in the new form.
The title
was extracted from lines with <h1>
s. The filename
front-matter variable was created by lower-casing the title
, replacing whitespace with dashes, and removing apostrophes. So, I'm A Blue Cat
became im-a-blue-cat.md
. API pages contained a section called “Method”, so depending on whether or not a recorded section contained Method
, I set a boolean isAPI
variable.
Finally, once more using the fs
module, I saved each file in the correct place.
Some pages were top-level pages, others were nested. URL remapping was done with a simple configuration JSON object, which was stored locally and pulled in using require
. Something like this:
// config.js
{
"topLevelPages": [
"home",
"getting-started",
"contact"
],
"urlMap": {
"error-response": "/docs/backlog/error-response",
"version-history": "/docs/backlog/changes",
"oauth-20" : "/docs/backlog/api/2/oauth2"
}
}
// mainscript.js
let config = require('./config');
if(config.topLevelPages.contains(somePageID)) {
// url remapping logic
}
Watching the script run was beautiful.
node conversion-script.js md link-to-the-big-data-file desired-output.md
spat out a Markdown file.
node conversion-script.js json link-to-the-big-data-file desired-output.json
spat out a JSON file.
node conversion-script.js hugo link-to-the-big-data-file
first converted the data to markdown, then built an in-memory JSON object, and then used that object to generate front matter with ES6 template strings, and then save each object as a file.
Hugo is smart enough to change translations based on the file extensions. So, cats.md
is the English version of a text, while cats.ja.md
would be the Japanese translation.
There was another huge file for the Japanese version of the API documentation. The english version of the script had certain keywords - markers of sorts that when detected, would indicate that a new section had started or ended. My naive approach was to simply swap out the English “keywords” for the Japanese keywords. However, I immediately noticed this wouldn’t work because the URLs had to remain in English.
I was able to safely make certain assumptions that made this task infinitely easier. Most notably, I was able to assume that the ordering of the dataset was the same in both the English and Japanese versions.
This allowed me to take an expediential, gimme-dat-tech-debt approach, since we were already crunched for time on this project. I duplicated the script I had made to parse the English version and stripped it down to the barebones parsing logic. I then extracted all the #
and ##
lines (to extract all headings and subheadings). I saved them in two separate files:
headings.json
and subheadings.json
.
I then created a third duplicate version of my script to process the Japanese and require
d the two JSON files above.
However, instead of extracting headings from the text like the English version, the Japanese version of the script would “pop” a new value off of the headings-array
and subheadings-array
whenever it detected that the next heading/subheading should be saved. The headings would be reformatted (as mentioned earlier) to become filenames
, which were then tweaked: cat.md
became cat.ja.md
.
It worked like a charm! 🎉
The Japanese files populated beautifully with English filenames (which were also used to map their URL paths).
About 90% of the files converted with no flaws, but there were inevitably some hiccups. I went through manually and fixed many of the errors that had simply gone under the radar. It was irritating but the time spent on automating all of the rest of the data extraction was well worth it, considering there was over 130 API items.
Plus - we now have a CLI tool to convert from our proprietary syntax to Markdown and JSON - sweet.
The finished site is available here.
View All Posts