Multi- to Monorepo Migration

This script merges multiple independent tiny repositories into a single “monorepo”. The summary is “every repo is moved into its own subdirectory, branches are merged.” See Example for exactly how this works.

Further reading: https://syslog.ravelin.com/multi-to-mono-repository-c81d004df3ce

$ cat my-repos.txt
git@github.com:mycompany/my-repo-abc.git abc
git@github.com:mycompany/my-repo-def.git def
$ /path/to/tomono < my-repos.txt

... noise noise noise

$ cd core # your monorepo is here now

Features

🕙 Full history of all your prior repos is intact, no changes to checksums
#️⃣ Signatures of old repos stay valid
🔁 Create the monorepo and keep pulling in changes from your minirepos later
🔀 Pull in entire new repos as you go, no need to prepare the whole thing at once
🏷 Tags are namespaced to avoid clashes, but tag signatures remain valid
🉑 Branches with weird names (slashes, etc)
👥 No conflicts between files with the same name
📁 Every project gets its own subdirectory

Usage

Run the tomono script with your config on stdin, in the following format:

$ cat my-repos.txt
git@github.com:mycompany/my-repo-abc.git  abc
git@github.com:mycompany/my-repo-def.git  def
git@github.com:mycompany/my-lib-uuu.git   uuu  lib/uuu
git@github.com:mycompany/my-lib-zzz.git   zzz  lib/zzz

Concrete example:

$ cat my-repos.txt | /path/to/tomono

That should be all ✅.

Custom name for monorepo directory

Don’t like core? Set a different name through an envvar before running the script:

export MONOREPO_NAME=the-big-repo

Custom “master” / “main” branch name

No need to do anything. This script does not handle any master / main branch in any special way. It just merges whatever branches exist. Don’t have a “master” branch? None will be created.

Make sure your own computer has the right branch set up in its init.defaultBranch setting.

Continue existing migration

Large teams can’t afford to “stop the world” while a migration is in progress. You’ll be fixing stuff and pulling in new repositories as you go.

Here’s how to pull in an entirely new set of repositories:

/path/to/tomono --continue < my-new-repos.txt

Make sure you have your environment set up exactly the same as above. Particularly, you must be in the parent dir of the monorepo.

Example

Run these commands to set up a fresh directory with git monorepos that you can later merge:

Initial setup of fake repos

d="$(mktemp -d)"
echo "Setting up fresh multi-repos in $d"
cd "$d"

mkdir foo
(
    cd foo
    git init
    git commit -m "foo’s empty root" --allow-empty
    echo "This is foo" > i-am-foo.txt
    git add -A
    git commit -m "foo’s master"
    git tag v1.0
    git checkout -b branch-a
    echo "I am a new foo feature" > feature-a.txt
    git add -A
    git commit -m "foo’s feature branch A"
)

mkdir bar
(
    cd bar
    git init
    echo "This is bar" > i-am-bar.txt
    git add -A
    git commit -m "bar’s master"
    git tag v1.0
    git checkout -b branch-a
    echo "I am bar’s side of feature A" > feature-a.txt
    git add -A
    git commit -m "bar’s feature branch A"
    git branch branch-b master
    git checkout branch-b
    echo "I am an entirely new feature of bar: B" > feature-b.txt
    git add -A
    git commit -m "bar’s feature branch B"
)

You now have two directories:

foo (branches: master, branch-a)
bar (branches: master, branch-a, branch-b)

Combine into monorepo

Assuming the tomono script is in your $PATH, you can invoke it like this, from that same directory:

tomono <<EOF
$PWD/foo foo
$PWD/bar bar
EOF

This will create a new directory, core, where you can find a git tree which looks somewhat like this:

*   Merge foo/branch-a (branch-a)
|\
| * foo’s feature branch A (foo/branch-a)
* |   Merge bar/branch-a
|\ \
| * | bar’s feature branch A (bar/branch-a)
* | | Root commit for monorepo branch branch-a
 / /
| | *   Merge foo/master (HEAD -> master)
| | |\
| | |/
| |/|
| * | foo’s master (tag: foo/v1.0, foo/master)
| * | foo’s empty root
|  /
| *   Merge bar/master
| |\
| |/
|/|
| * Root commit for monorepo branch master
|
| *   Merge bar/branch-b (branch-b)
| |\
| | * bar’s feature branch B (bar/branch-b)
| |/
|/|
* | bar’s master (tag: bar/v1.0, bar/master)
 /
* Root commit for monorepo branch branch-b

Pull in new changes from a remote

It’s possible that while you’re working on setting up your fresh monorepo, new changes have been pushed to the existing single repos:

(
	cd foo
	echo New changes >> i-am-foo.txt
	git commit -va -m 'New changes to foo'
)

Because their history was imported verbatim and nothing has been rewritten, you can import those changes into the monorepo.

First, fetch the changes from the remote:

$ cd core
$ git fetch foo

Now merge your changes using subtree merge:

git checkout master
git merge -X subtree=foo/ foo/master

And the updates should be reflected in the monorepo:

$ cat foo/i-am-foo.txt
This is foo
New changes

I used the branch master in this example, but any branch works the same way.

Continue

Now imagine you want to pull in a third repository into the monorepo:

mkdir zimlib
(
    cd zimlib
    git init
    echo "This is zim" > i-am-zim.txt
    git add -A
    git commit -m "zim’s master"
    git checkout -b branch-a
    echo "I am a new zim feature" > feature-a.txt
    git add -A
    git commit -m "zim’s feature branch A"
    # And some more weird stuff, to mess with you
    git checkout master
    git checkout -d
    echo top secret > james-bond.txt
    git add -A
    git commit -m "I am unreachable"
    git tag leaking-you HEAD
    git checkout --orphan empty-branch
    git rm --cached -r .
    git clean -dfx
    git commit -m "zim’s tricky empty orphan branch" --allow-empty
)

Continue importing it:

echo "$PWD/zimlib zim lib/zim" | tomono --continue

Note that we used a different name for this subrepo, inside the lib dir.

The result is that it gets imported into the existing monorepo, alongside the existing two projects:

$ cd core
$ git checkout master
Switched to branch 'master'
$ tree
.
├── bar
│   └── i-am-bar.txt
├── foo
│   └── i-am-foo.txt
└── lib
    └── zim
        └── i-am-zim.txt

4 directories, 3 files
$ git checkout branch-a
Switched to branch 'branch-a'
$ tree
.
├── bar
│   ├── feature-a.txt
│   └── i-am-bar.txt
├── foo
│   ├── feature-a.txt
│   └── i-am-foo.txt
└── lib
    └── zim
        ├── feature-a.txt
        └── i-am-zim.txt

4 directories, 6 files
$ head **/feature-a.txt
==> bar/feature-a.txt <==
I am bar’s side of feature A

==> foo/feature-a.txt <==
I am a new foo feature

==> lib/zim/feature-a.txt <==
I am a new zim feature

Implementation

(This section is best viewed at https://tomono.0brg.net/, the GitHub Readme viewer misses some info)

The outer program structure is a flat bash script which loops over every repo supplied over stdin:

<<init>>

# Note this is top-level in the script so it’s reading from the script’s stdin
while <<windows-fix>> read repourl reponame repopath; do
    if [[ -z "$repopath" ]]; then
        repopath="$reponame"
    fi

    <<handle-remote>>
done

<<finalize>>

# <<copyright>>

Per repository

Every repository is fetched and fully handled individually, and sequentially:

fetch all the data related to this repository,
immediately check out and initialise every single branch which belongs to that repository.

git remote add "$reponame" "$repourl"
git config --add "remote.$reponame.fetch" "+refs/tags/*:refs/tags/$reponame/*"
git config "remote.$reponame.tagOpt" --no-tags
git fetch --atomic "$reponame"

<<list-branches>> | while read branch ; do
    <<handle-branch>>
done

The remotes are configured to make sure that a default fetch always fetch all tags, and also puts them in their own namespace. The default refspec for tags is +refs/tags/*:refs/tags/*, as you can see that puts everything from the remote at the same level in your monorepo. Obviously that will cause clashes, so we add the reponame as an extra namespace.

The --no-tags option is the complement to --tags, which has that default refspec we don’t want. That’s why we disable it and roll our own, entirely.

Per branch (this is where the magic happens)

In the context of a single repository check-out, every branch is independently checked out into a subdir for that repository, and merged into the monorepo.

This is the money shot.

<<ensure-on-target-branch-in-monorepo>>

git merge --strategy=ours "$reponame/$branch" --allow-unrelated-histories --no-commit --no-ff
git read-tree --prefix "$repopath" "$reponame/$branch"
git commit -m "Merge $reponame/$branch" --allow-empty

Source: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

Technically, it’s not actually “checked out”: that implies changing the files in your work tree, i.e. on disk. Rather only the index is changed.

Ensure we are on the right branch

In this snippet, we ensure that we are ready to merge fresh code from a subrepo into this branch: either we checkout an existing branch in the monorepo by this name, or we create a fresh one.

We are given the variable $branch which is the final name of the branch we want to operate on. It is the same as the name of the branch in each individual target repo.

if ! git show-ref --verify --quiet "refs/heads/$branch"; then
    git branch -- "$branch" "$(printf "Root commit for monorepo branch %s" "$branch" | git commit-tree "$empty_tree")"
fi
git symbolic-ref HEAD "refs/heads/$branch"
git reset

Instead of using git checkout --orphan and trying to create a new empty commit from the index, we create the empty commit first, and update the HEAD to it directly. This lets us stay entirely in the index without bothering with the work tree.

Sources:

Non-goal: merging into root

GitHub user @woopla proposed in #42 the ability to merge a minirepo into the monorepo root, as if you used . as the subdirectory. We ended up not going for it, but it was interesting to investigate how to do this with git read-tree. The closest I got was:

if [[ "$repopath" == "." ]]; then
    # Experimental—is this how git read-tree works? I find it very confusing.
    git read-tree "$branch" "$reponame/$branch"
else
    git read-tree --prefix "$repopath" "$reponame/$branch"
fi

I must to confess I find the git read-tree man page too daunting to fully stand by this. I mostly figured it out by trial and error. It seems to work?

If anyone could explain to me exactly what this tool is supposed to do, what those separate stages are (it talks about “stage 0” to “stage 3” in its 3 way merge), and how you would cleanly do this, just for argument’s sake, I’d love to know.

But, as it turned out, this tool already has a way to merge a repo into the root: just make it the monorepo, and use it as a target for a --continue operation. That solves that.

Set up the monorepo directory

We create a fresh directory for this script to run in, or continue on an existing one if the --continue flag is passed.

# Poor man’s arg parse :/
arg="${1-}"
: "${MONOREPO_NAME:=core}"

if [[ "$arg" == "" ]]; then
	if [[ -d "$MONOREPO_NAME" && "$arg" != "--continue" ]]; then
		>&2 echo "monorepo directory $MONOREPO_NAME already exists"
		exit 1
	fi
	mkdir "$MONOREPO_NAME"
	cd "$MONOREPO_NAME"
	git init
elif [[ "$arg" != "--continue" ]]; then
	>&2 echo "Unexpected argument: $arg"
	>&2 echo
	>&2 echo "Usage: $0 [--continue]"
	exit 1
elif [[ ! -d "$MONOREPO_NAME" ]]; then
	>&2 echo "Asked to --continue, but monorepo directory $MONOREPO_NAME doesn’t exist"
	exit 1
else
	cd "$MONOREPO_NAME"
	if git status --porcelain | grep . ; then
		>&2 echo "Git status shows pending changes in the repo. Cannot --continue."
		exit 1
	fi
	# There isn’t anything special about --continue, really.
fi

Most of this rigmarole is about UI, and preventing mistakes. As you can see, there is functionally no difference between continuing and starting fresh, beyond mkdir and git init. At the end of the day, every repo is read in greedily, and whether you do that on an existing monorepo, or a fresh one, doesn’t matter: every repo name you read in, is in fact itself like a --continue operation.

It’s horrible and kludgy but I just want to get something working out the door, for now.

List individual branches

I want a single branch name per line on stdout, for a single specific remote:

git branch -r --no-color --list "$reponame/*" --format "%(refname:lstrip=3)"

Implementations that didn’t make the cut

Solutions I abandoned, due to one short-coming or another:

`git branch -r` with grep

The most straight-forward way to list branch names:

$ git branch -r
  bar/branch-a
  bar/branch-b
  bar/master
  foo/branch-a
  foo/master

This could be combined with grep to filter all branches for a specific remote, and filter out the name. It’s very close, but how do you reliably remove an unknown string?

`find .git/refs/hooks`

( cd ".git/refs/remotes/$reponame" && find . -type f -mindepth 1 | sed -e s/..// )

Closer, but ugly, and I got reports that it missed some branches (although I was never able to repro)

`git ls-remote`

git ls-remote --heads --refs "$reponame" | sed 's_[^ ]* *refs/heads/__'

Originally suggested in a PR 39, I’ve decided not to use this because git-ls-remote actively queries the remote to list its branches, rather than inspecting the local state of whatever we just fetched. That feels like a race condition at best, and becomes very annoying if you’re dealing with password protected remotes or otherwise inaccessible repos.

Init & finalize

Initialization is what you’d expect from a shell script:

<<set-flags>>

<<prep-dir>>

empty_tree="$(git hash-object -t tree /dev/null)"

On the other side, when done, update the working tree to whatever the current branch is to avoid any confusion:

git checkout .

Error flags, warnings, debug

Various sh flags allow us to control the behaviour of the shell: treat any unknown variable reference as an error, treat any non-zero exit status in a pipeline as an error (instead of only looking at the last program), and treat any error as fatal and quit. Additionally, if the DEBUGSH environment variable is set, enable “debug” mode by echoing every command before it gets executed.

set -euo pipefail ${DEBUGSH+-x}

Windows newline fix

On Windows the config file could contain windows newline endings (CRLF). Bash doesn’t handle those as proper field separators. Even on Windows…

We force it by adding CR as a field separator:

IFS=$'\r'"$IFS"

It can’t hurt to do this on other computers, because who has a carriage return in their repo name or path? Nobody does.

The real question is: why is this not standard in Bash for Windows? Who knows. I’d add it to my .bashrc if I were you 🤷‍♀️.

Building the code

The easiest way to build everything in this repo is using docker:

docker-compose run --rm build

Most of the code in this repository is generated from this readme file. This can be done in stock Emacs, by opening this file and calling M-x org-babel-tangle.

This file can also be exported to HTML. Executing the block below, before you export it, adds some extra flourish to that exported file:

;; This is configuration for org mode itself, not tomono src code. Don't export this.

;; TODO: Clean this up. No globals etc.

(require 'cl-lib)
(require 'dash)
(require 's)

(defun org-info-name (info)
  (nth 4 info))

(defun insert-ln (&rest args)
  (apply #'insert args)
  (newline))

(defun should-reference (info)
  "Determine if this info block is a referencing code block"
  (not (memq (alist-get :noweb (nth 2 info))
             '(nil "no"))))

(defun re-findall (re str &optional offset)
  "Find all matches of a regex in the given string"
  (let ((start (string-match re str offset))
        (end (match-end 0)))
    (when (numberp start)
      (cons (substring str start end) (re-findall re str end)))))

;; Match groups are the perfect tool to achieve this but EL's regex is
;; inferior and it's not worth the hassle. Blag it manually.

(defun strip-delimiters (s prefix suffix)
  "Strip a prefix and suffix delimiter, e.g.:
(strip-delimiters \"<a>\" \"<\" \">\")
=> \"a\"

Note this function trusts the input string has those delimiters"
  (substring s (length prefix) (- (length suffix))))

(defun strip-noweb-delimiters (s)
  "Strip the org noweb link delimiters, usually << and >>"
  (strip-delimiters s org-babel-noweb-wrap-start org-babel-noweb-wrap-end))

(defun extract-refs (body)
  (mapcar #'strip-noweb-delimiters (re-findall (org-babel-noweb-wrap) body)))

(defun add-to-hash-list (k elem hash)
  "Assuming the hash values are lists, add this element to k's list"
  (puthash k (cons elem (gethash k hash)) hash))

(defun register-refs (name refs)
  (puthash name refs forward-refs)
  ;; Add a backreference to every ref
  (mapc (lambda (ref)
          (add-to-hash-list ref name back-refs))
        refs))

(defun parse-blocks ()
  (let ((forward-refs (make-hash-table :test 'equal))
        (back-refs (make-hash-table :test 'equal)))
    (org-babel-map-src-blocks nil
      ;; Probably not v efficient, but should be memoized anyway?
      (let* ((info (org-babel-get-src-block-info full-block))
             (name (org-info-name info)))
        (when (and name (should-reference info))
          (register-refs name (extract-refs body)))))
    (list forward-refs back-refs)))

(defun tomono--format-ref (ref)
  (format "[[%s][%s]]" ref ref))

(defun insert-references-block (info title refs)
  (when refs
    (insert title)
    (->> refs (mapcar 'tomono--format-ref) (s-join ", ") insert-ln)
    (newline)))

(defun insert-references (info forward back)
  (when (or forward back)
    (newline)
    (insert-ln ":REFERENCES:")
    (insert-references-block info "References: " forward)
    (insert-references-block info "Used by: " back)
    (insert-ln ":END:")))

(defun get-references (name)
  (list (gethash name forward-refs) (gethash name back-refs)))

(defun fix-references (backend)
  "Append a references section to every noweb codeblock"
  (cl-destructuring-bind (forward-refs back-refs) (parse-blocks)
    (org-babel-map-src-blocks nil
      (let ((info (org-babel-get-src-block-info full-block)))
        (when (should-reference info)
          (pcase-let ((`(,language ,body ,arguments ,switches ,name ,start ,coderef) info))
            (goto-char end-block)
            (apply #'insert-references info (get-references name))))))))

(add-hook 'org-export-before-parsing-hook 'fix-references nil t)

;; The HTML output
(let ((org-html-htmlize-output-type 'css))
  (org-html-export-to-html))

Tests

The examples can be combined into a test script:

set -xeuo pipefail
shopt -s globstar

# The tomono script is tangled right next to the test script
export PATH="$PWD:$PATH"

<<test-setup>>
<<test-run>>
<<test-evaluate>>

All we need is to write the code that actually evaluates the tests and fixtures:

cd core

echo "Checking branch list"
diff -u <(git branch --no-color --list --format "%(refname:lstrip=2)" | sort) <(cat <<EOF
branch-a
branch-b
empty-branch
master
EOF
)

echo "Checking master"
git checkout master
diff -u <(head **/*.*) <(cat <<EOF
==> bar/i-am-bar.txt <==
This is bar

==> foo/i-am-foo.txt <==
This is foo

==> lib/zim/i-am-zim.txt <==
This is zim
EOF
)

echo "Checking branch-a"
git checkout branch-a
diff -u <(head **/*.*) <(cat <<EOF
==> bar/feature-a.txt <==
I am bar’s side of feature A

==> bar/i-am-bar.txt <==
This is bar

==> foo/feature-a.txt <==
I am a new foo feature

==> foo/i-am-foo.txt <==
This is foo

==> lib/zim/feature-a.txt <==
I am a new zim feature

==> lib/zim/i-am-zim.txt <==
This is zim
EOF
)

I use that weird diff -u <(..) trick instead of a string compare like [[ "foo" = “…” ]]=, because the diff shows you where the problem is, instead of just failing the test without comment.

Copyright and license

This is a cleanroom reimplementation of the tomono.sh script, originally written with copyright assigned to Ravelin Ltd., a UK fraud detection company. There were some questions around licensing, and it was unclear how to go forward with maintenance of this project given its dispersed copyright, so I went ahead and rewrote the entire thing for a fresh start.

The license and copyright attribution of this entire document can now be set:

Copyright © 2020, 2022 Hraban Luyat

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program.  If not, see <https://www.gnu.org/licenses/>.

I did not look at the original implementation at all while developing this.

mirrors_ANYbotics/tomono

Multi- to Monorepo Migration

Features

Usage

Custom name for monorepo directory

Custom “master” / “main” branch name

Continue existing migration

Tags

Lightweight vs. Annotated Tags

Example

Initial setup of fake repos

Combine into monorepo

Pull in new changes from a remote

Continue

Implementation

Per repository

Per branch (this is where the magic happens)

Ensure we are on the right branch

Non-goal: merging into root

Set up the monorepo directory

List individual branches

Implementations that didn’t make the cut

`git branch -r` with grep

`find .git/refs/hooks`

`git ls-remote`

Init & finalize

Error flags, warnings, debug

Windows newline fix

Building the code

Tests

Copyright and license

简介

发行版

贡献者

近期动态

mirrors_ANYbotics/tomono

Multi- to Monorepo Migration

Features

Usage

Custom name for monorepo directory

Custom “master” / “main” branch name

Continue existing migration

Tags

Lightweight vs. Annotated Tags

Example

Initial setup of fake repos

Combine into monorepo

Pull in new changes from a remote

Continue

Implementation

Per repository

Per branch (this is where the magic happens)

Ensure we are on the right branch

Non-goal: merging into root

Set up the monorepo directory

List individual branches

Implementations that didn’t make the cut

git branch -r with grep

find .git/refs/hooks

git ls-remote

Init & finalize

Error flags, warnings, debug

Windows newline fix

Building the code

Tests

Copyright and license

简介

发行版

贡献者

近期动态

`git branch -r` with grep

`find .git/refs/hooks`

`git ls-remote`