A Comment-Preserving Pretty-Printer

Posted on 2022-11-17 by Theophilos Giannakopoulos

The goal is to implement a comment-preserving autoformatter for a lambda calculus language using s-expressions. Without comments, the grammar of the languages is (with v as a metavariable for identifiers and n as a metavariable for literal integers):

v ∈ Vars
n ∈ Integers
e ::= v           : variable
    | n           : literal integer
    | ( fun v e ) : abstraction
    | ( e e )     : application

In our little language comments begin with a semicolon (;) character and extend to the end of the line. Comments can occur anywhere whitespace can occur.

For example, the following would be a valid program in our little language:

; Definition of the Y combinator
(fun f ; function to make recursive

  ; remember, 'fun' is 'lambda'
    ((fun x
  (f (x
x)
           ))
(fun x (f (x x)))))

After pretty-printing (with a target of 25 characters per line), it will look like:

; Definition of the Y combinator
(fun f ; function to make recursive
   ; remember, 'fun' is 'lambda'
   ((fun x (f (x x)))
    (fun x (f (x x)))))

There are two steps involved in auto-formatting the language:

parsing in a way that preserves comment information, and
rendering the syntax tree.

First, we have our language extensions and imports. We’ll be using the following libraries in this implementation:

text,
megaparsec,
parser-combinators,
parsers, and
prettyprinter.

{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}

module PrettyPrintWithComments where

import Control.Applicative (many)
import Data.Foldable (asum)
import Data.Functor (void)
import Data.List (intersperse)
import Data.List.NonEmpty (NonEmpty, some1)
import Data.Maybe (catMaybes, isNothing)
import Data.Text (Text)
import Data.Text qualified as Text
import Data.Void (Void)
import Prettyprinter (Doc, Pretty (pretty))
import Prettyprinter qualified as PP
import Text.Megaparsec qualified as P
import Text.Megaparsec.Char qualified as P
import Text.Megaparsec.Char.Lexer qualified as Lexer
import Text.Parser.Combinators qualified as P (skipOptional)

Preserving Comments While Parsing

What information about the comments we need to put the syntax tree depends on what about the comments we want to preserve while formatting the code. In this example, we want to keep comments on the same line as some code on that line, and we want to maintain separation between blocks of comments that are on their own lines and separated by empty lines.

newtype Comment = Comment
    { -- | The content of the comment, not including the leading comment
      -- character
      commentContent :: Text
    }
    deriving (Show)

data CommentSection = CommentSection
    { -- | The comment on the same line as the lexeme, if there is one
      commentSameLine :: Maybe Comment
    , -- | Blocks of comments following lexemes
      commentBlocks :: [NonEmpty Comment]
    }
    deriving (Show)

If we wanted to preserve user-inserted line breaks or otherwise use them as hints for formatting, we would also need to store where the line breaks occur.

We’ll be using a parser combinator library. The typical way to use a parser combinator is to first define the combinators for dealing with lexemes, and then use those to define the parsers for syntactic elements. Whitespace is handled by having lexeme parsers consume and discard all trailing whitespace.

Because we want to preserve comment information, instead of having the whitespace parser discard all of its information, we will have it preserve the comment information that we are interested in.

When we parse a single line of a comment, we will consume the trailing whitespace from the concluding newline through either the end of the next line (not including the newline) or the first non-whitespace character, whichever comes first.

type Parser = P.Parsec Void Text

commentP :: Parser Comment
commentP = do
    void $ P.char ';'
    content <- P.takeWhileP (Just "character") (/= '\n')
    P.skipOptional P.newline <* P.skipMany P.hspace1
    pure $ Comment content

A block of comments is just one or more single comment lines.

commentBlockP :: Parser (NonEmpty Comment)
commentBlockP = some1 commentP

And comment blocks can be separated by lines of only whitespace.

commentBlocksP :: Parser [NonEmpty Comment]
commentBlocksP = fmap catMaybes . many . P.choice $
    [ Just <$> commentBlockP
    , Nothing <$ P.space1
    ]

Given these definitions, we make a new definition of space that serves the same purpose as the one from Megaparse.Char.Lexer, except that it also produces the comments it encountered while consuming whitespace.

space :: Parser CommentSection
space = P.hidden $ do
    P.skipMany P.hspace1
    sameLineC <- P.optional commentP
    cs <- commentBlocksP
    pure $ CommentSection sameLineC cs

We define a similar variants of lexeme and symbol. Since token parsers consume trailing whitespace, parsing a token produces the comments included in that trailing whitespace.

lexeme :: Parser a -> Parser (a, CommentSection)
lexeme p = do
    x <- p
    cs <- space
    pure (x, cs)

symbol :: Text -> Parser CommentSection
symbol = fmap snd . lexeme . P.string

With that, we define parsers for the tokens in our language.

lparenP :: Parser CommentSection
lparenP = symbol "("

rparenP :: Parser CommentSection
rparenP = symbol ")"

funP :: Parser CommentSection
funP = symbol "fun"

dotP :: Parser CommentSection
dotP = symbol "fun"

colonP :: Parser CommentSection
colonP = symbol "::"

identifierP :: Parser (Text, CommentSection)
identifierP = lexeme (fst <$> P.match (P.letterChar *> many P.alphaNumChar))

literalP :: Parser (Integer, CommentSection)
literalP = lexeme Lexer.decimal

The definition of the AST is standard, except for one field on each of the Abs and App constructors which are used to capture information about the comments that appear within those syntactic constructs.

Despite including the comments that follow a lexical token with the parsing of that token, when representing the AST it is easier to work only with the comments contained within the syntactic element (in part due to how the prettyprinter library handles indentation and newlines, which we’ll explain later).

The inclusion only of the comments that occur within the syntactic elements is why the Lit and Var constructors have no comment field.

For AbsComments and AppComments, the comment fields is named after the preceding syntactic element. Because only the comments contained within the overall syntactic construct are included, there is an lparenComments field, but no rparenComments field. The fields that would trail the right parenthesis are instead included in the containing construct.

type Var = Text

data Exp
    = Lit Integer
    | Var Var
    | Abs AbsComments Var Exp
    | App AppComments Exp Exp
    deriving (Show)

data AbsComments = AbsComments
    { lparenComments :: CommentSection
    , keywordComments :: CommentSection
    , boundNameComments :: CommentSection
    , bodyComments :: CommentSection
    }
    deriving (Show)

data AppComments = AppComments
    { lparenComments :: CommentSection
    , functionComments :: CommentSection
    , argumentComments :: CommentSection
    }
    deriving (Show)

The overall program starts at the beginning of the input and goes to the end of the input, and so includes all leading and trailing comments outside of the expressions being parsed. Leading comments can’t occur on the same line as some preceding code, because there is no preceding code, so the leading comments are just a list of Comment blocks instead of a CommentSection.

data Program = Program [NonEmpty Comment] Exp CommentSection
    deriving (Show)

As with the parsers for the lexical tokens, the parsers for the syntactic elements produce both the parsed value and the trailing comment. With the literal and variable parsers, this is straightforward: the parsers just wrap the values with the appropriate constructors.

litP :: Parser (Exp, CommentSection)
litP = do
    (n, c) <- literalP
    pure (Lit n, c)

varP :: Parser (Exp, CommentSection)
varP = do
    (v, c) <- identifierP
    pure $ (Var v, c)

For abstraction and application, we first define the parsers for the content within the parentheses. Because we aren’t parsing the parentheses here, we don’t have the comment following the left parenthesis and so must take it as an argument. We also don’t need to return an extra comment, because the comment trailing the last token parsed is still within the construct.

absP :: Parser (CommentSection -> Exp)
absP = do
    kwc <- funP
    (x, xc) <- identifierP
    (e, ec) <- expP
    pure $ \lp -> Abs (AbsComments lp kwc xc ec) x e

appP :: Parser (CommentSection -> Exp)
appP = do
    (f, fc) <- expP
    (e, ec) <- expP
    pure $ \lp -> App (AppComments lp fc ec) f e

We then use the withParensP helper to handle parsing the parentheses, providing the comment trailing the left parenthesis and producing the comment trailing the right parenthesis.

parensP :: Parser a -> Parser (CommentSection, a, CommentSection)
parensP p = (,,) <$> lparenP <*> p <*> rparenP

withParensP :: Parser (CommentSection -> b) -> Parser (b, CommentSection)
withParensP p = do
    (lc, f, rc) <- parensP p
    pure $ (f lc, rc)

With that, we can assemble our expression parser and program parser.

expP :: Parser (Exp, CommentSection)
expP =
    asum
        [ withParensP (asum [absP, appP])
        , varP
        , litP
        ]

programP :: Parser Program
programP = do
    P.skipMany P.hspace1
    leadingC <- commentBlocksP
    (e, trailingC) <- expP
    pure $ Program leadingC e trailingC

Pretty-Printing with Comments

Now that we can parse programs while preserving comment information, we can move on to rendering the syntax in our desired format. We will use prettyprint Wadler-Leijen pretty printing library to handle the actual layout.

First, we define how to pretty-print the comments. This is mostly straightforward, except that the final newline of a comment block has to be omitted. The nest function from the prettyprint library does not affect indentation until after the next newline (rather than affecting indentation of the first character after the newline). Therefore, if we want the indentation of anything following a comment to differ from the indentation of the comment, we need to be able to emit the newline after the close of the indentation group.

instance Pretty Comment where
    pretty c = (";" <>) . pretty . Text.stripEnd $ commentContent c
    prettyList cs =
        if null cs
            then mempty
            else mconcat $ intersperse PP.hardline (map pretty cs)

prettyCommentBlocks :: [NonEmpty Comment] -> Doc a
prettyCommentBlocks =
    mconcat . PP.punctuate (PP.hardline <> PP.hardline) . map pretty

instance Pretty CommentSection where
    pretty (CommentSection sl cs) =
        mconcat
            [ maybe mempty ((PP.space <>) . pretty) sl
            , if null cs then mempty else PP.hardline
            , prettyCommentBlocks cs
            ]

Despite preserving this ability, in our choice of pretty-printing below, we don’t end up needing it, and so we define the following helpers that we can use. However, if you want to have the closing parenthesis of an expression at a different indentation level than the indentation level of the previous part of the expression, then you will need to render the comment blocks and the final newline separately.

isEmpty :: CommentSection -> Bool
isEmpty (CommentSection sl cs) = isNothing sl && null cs

prettyCommentSectionOrElse :: Doc ann -> CommentSection -> Doc ann
prettyCommentSectionOrElse alt cb =
    mconcat
        [ pretty cb
        , if isEmpty cb then alt else PP.hardline
        ]

prettyCommentSection :: CommentSection -> Doc ann
prettyCommentSection = prettyCommentSectionOrElse mempty

hasComment :: Exp -> Bool
hasComment e = case e of
    Lit _ -> False
    Var _ -> False
    Abs (AbsComments lpc kwc vc bc) _ b ->
        or
            [ not (isEmpty lpc)
            , not (isEmpty kwc)
            , not (isEmpty vc)
            , not (isEmpty bc)
            , hasComment b
            ]
    App (AppComments lpc fc xc) f x ->
        or
            [ not (isEmpty lpc)
            , not (isEmpty fc)
            , not (isEmpty xc)
            , hasComment f
            , hasComment x
            ]

With all of this, the actual rendering is straightforward. Most of the implementation is about choices of how the rendered code should look:

What should the indentation level be? For this, I’ve chosen values that I think look nice.
Should a space be turned into newline (i.e., should we use space or softine) if there isn’t enough room? We don’t in some cases because it would not move the expression any further to the left.

…and so on.

instance Pretty Exp where
    pretty e = PP.group $ case e of
        Lit n -> pretty n
        Var v -> pretty v
        Abs (AbsComments lpc kwc vc bc) v b ->
            let vbBreak =
                    if not (isEmpty kwc) || hasComment b
                        then PP.hardline
                        else PP.softline
             in mconcat $
                [ PP.nest 1 . mconcat $
                    [ "("
                    , prettyCommentSection lpc
                    , "fun"
                    ]
                , PP.nest 5 . mconcat $
                    [ prettyCommentSectionOrElse PP.space kwc
                    , pretty v
                    ]
                , PP.nest 3 . mconcat $
                    [ prettyCommentSectionOrElse vbBreak vc
                    , pretty b
                    , prettyCommentSection bc
                    , ")"
                    ]
                ]
        App (AppComments lpc fc xc) f x ->
            let fxBreak =
                   if hasComment f || hasComment x
                       then PP.hardline
                       else PP.softline
             in mconcat $
                [ "("
                , PP.align . mconcat $
                    [ prettyCommentSection lpc
                    , pretty f
                    , prettyCommentSectionOrElse fxBreak fc
                    , pretty x
                    , prettyCommentSection xc
                    , ")"
                    ]
                ]

instance Pretty Program where
    pretty (Program leadingC e trailingC) =
        mconcat
            [ prettyCommentBlocks leadingC
            , if null leadingC then mempty else PP.hardline
            , pretty e
            , prettyCommentSectionOrElse PP.hardline trailingC
            ]

We are subject to the usual limitations of Wadler-Leijen pretty-printers in terms of where the layout algorithm will decide to insert line breaks. For example, when rendering our original example at 40 columns wide, the break is not in the ideal location:

; Definition of the Y combinator
(fun f ; function to make recursive
   ; remember, 'fun' is 'lambda'
   ((fun x (f (x x))) (fun x (f (x
                                 x)))))

Depending on the desired output for your pretty-printer, you might need to do additional wrangling of prettyprint, use a different rendering algorithm (such as a Bernardy pretty-printer) or use a custom algorithm.