As I'm always dissatisfied with existing editors, a project I always wanted to start is my own text editor. However doing text editing is serious business.
Besides analyzing the source code of existing text editors, is there any book or other resource (like academic work) about this topic? I'm interested especially in something that teaches how to handle memory and how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memmove
the huge text block...).
A text editor is a type of computer program that edits plain text. In Windows, such programs are sometimes known as "notepad" software, following the naming of Microsoft Notepad.
Notepad, Wordpad are some of the common editors used on Windows OS and vi, emacs, Jed, pico are the editors on UNIX OS. Features normally associated with text editors are — moving the cursor, deleting, replacing, pasting, finding, finding and replacing, saving etc.
Typical utilities include such programs as shells, text editors, compilers, and (sometimes) the file system.
Take a look at Rob Pike's description of his Sam text editor. Be sure to browse past the high-level overview and command language. He describes parsing, memory management, and data structures later in the paper.
In addition, take a look at Russ Cox's simple regular expression implementation. It's easy to follow and may open some doors outside existing regular expression libraries.
Over the years I've written quite a number of different text editors. Certainly the simplest way is to manage a long sequence of characters, where you copy everything around to insert any character. Other techniques that I've used include:
Many of the old Borland example books used a text editor as a tutorial example. You can occasionally still find copies of these at used bookstores for nearly free.
There's an excellent tutorial available here that covers a lot of relevant topics in a more modern context:
The other answers to this question cover gap buffer.
Another modern coverage is the descriptions of AvalonEdit
and extra detail from:
and there's a huge amount of detail/content (about SharpDevelop) in the book:
Promoted to answer by request:
The antique "Software Tools in Pascal" by Kernighan and Plaugher implements the ed
editor in a language with neither real strings nor pointers. It contains a great overview of the design considerations that underlie any text editor.
One old method that still works is called a gap buffer. The basic idea is that you put the text into a buffer, but instead of putting it all in one block, you create a "gap" at the cursor, putting all the text before the cursor at the beginning of the buffer, and all the text after the cursor at the end of the buffer. Most insertions take place at the cursor, which you can do without moving anything (until or unless you overflow the buffer). When the user moves the cursor, you move the appropriate text from one side of the split to the other.
Given typical controls (cursor left, right, up, down, page up, page down), the largest move you typically deal with is a page at a time, which is typically easy to handle quite a bit faster than a keyboard repeats. It can, of course, slow down a bit if you have a truly huge file and a "goto line" command, or something on that order. If you're going to do a lot of that, there are undoubtedly better structures to use...
The Craft of Text Editing by Craig Finseth, based on his master's thesis, covers these topics. It's free on the web. OTOH it's pretty old and doesn't mention some ideas like ropes that were less practical on the tiny computers of yesteryear.
This paper compares many data structures that can be used for text editors, including some already mentioned here (e.g., gap buffers) as well as several others (e.g., piece tables). The article is old, but I believe it still covers all the major choices, and it nicely compares them in terms of performance, complexity, and space overhead.
I know Stack Overflow answers aren't just supposed to be links, but this is still the best source of information I've ever found for the information requested, and it's far too long to summarize in an answer here. In case the link goes stale, search for "Data Structures for Text Sequences" by Charles Crowley of the University of New Mexico.
The Scintilla component uses a split buffer, along the theory explained in a text linked in their Scintilla and SciTE Related Sites page.
The linked page is Data Structures in a Bit-Mapped Text Editor.
The split buffer proved it works well even with megabyte files. Using secondary structures (eg. a list of line starts) can help too.
Here's how "the pros" at Microsoft do it:
MS Word uses the piece table data structure. An ascending array of character positions is maintained in parallel with an array of larger structures that contain pointers into either the original file (text loaded on file load) or into an append-only buffer of newly added characters.
The EDIT control used everywhere in Windows uses no data structure at all. Notepad simply uses a multi-line EDIT control. Check it out with a heap viewer. The EDIT control maintains only a buffer of line-breaks and tab-stops.
If you're going to create a simple non-formatted text editor in Windows, you could easily subclass the EDIT control to add features.
how to manage text insertion (if you have a 100 MB file and want to add a char at x position, you can't just memove the huge text block...).
Make the text file a linked list, where every line is an entry.
Well if you know that in general people will have relatively few insert points, you could hold an array of pointers into your original text buffer and when the user tries to insert inside it, you "split" the buffer by spawning another pointer to the rest of the buffer, making the first pointer's length so it stops at the insertion point and adding a 3rd pointer for the text to be inserted inbetween, kinda like:
long original text la la la
^ *^
| 2nd part
1st part
And the star points into a new text buffer where you start adding the text to be inserted.
When you render (or in general parse) your text file, you loop over the array of pointers and then do your work on each buffer. Of course if the buffer is small enough, you skip the adding a new buffer part, but that's just heuristics, try each and get a feel for which works best.
You could also consider splitting the text file on load into multiple buffers say every 1MB or so, because if you load the file in a single buffer you WILL need to make a new buffer for inserted text due to the size. Again, this is a heuristic optimization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With