A Comment-Preserving Pretty-Printer
The goal is to implement a comment-preserving autoformatter for a lambda
calculus language using s-expressions. Without comments, the grammar of the
languages is (with v as a metavariable for identifiers and n as a
metavariable for literal integers):
v ∈ Vars
n ∈ Integers
e ::= v : variable
| n : literal integer
| ( fun v e ) : abstraction
| ( e e ) : application
In our little language comments begin with a semicolon (;) character and
extend to the end of the line. Comments can occur anywhere whitespace can occur.
For example, the following would be a valid program in our little language:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x
(f (x
x)
))
(fun x (f (x x)))))
After pretty-printing (with a target of 25 characters per line), it will look like:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x (f (x x)))
(fun x (f (x x)))))
There are two steps involved in auto-formatting the language:
- parsing in a way that preserves comment information, and
- rendering the syntax tree.
First, we have our language extensions and imports. We’ll be using the following libraries in this implementation:
- text,
- megaparsec,
- parser-combinators,
- parsers, and
- prettyprinter.
{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}
module PrettyPrintWithComments where
import Control.Applicative (many)
import Data.Foldable (asum)
import Data.Functor (void)
import Data.List (intersperse)
import Data.List.NonEmpty (NonEmpty, some1)
import Data.Maybe (catMaybes, isNothing)
import Data.Text (Text)
import Data.Text qualified as Text
import Data.Void (Void)
import Prettyprinter (Doc, Pretty (pretty))
import Prettyprinter qualified as PP
import Text.Megaparsec qualified as P
import Text.Megaparsec.Char qualified as P
import Text.Megaparsec.Char.Lexer qualified as Lexer
import Text.Parser.Combinators qualified as P (skipOptional)Preserving Comments While Parsing
What information about the comments we need to put the syntax tree depends on what about the comments we want to preserve while formatting the code. In this example, we want to keep comments on the same line as some code on that line, and we want to maintain separation between blocks of comments that are on their own lines and separated by empty lines.
newtype Comment = Comment
{ -- | The content of the comment, not including the leading comment
-- character
commentContent :: Text
}
deriving (Show)
data CommentSection = CommentSection
{ -- | The comment on the same line as the lexeme, if there is one
commentSameLine :: Maybe Comment
, -- | Blocks of comments following lexemes
commentBlocks :: [NonEmpty Comment]
}
deriving (Show)If we wanted to preserve user-inserted line breaks or otherwise use them as hints for formatting, we would also need to store where the line breaks occur.
We’ll be using a parser combinator library. The typical way to use a parser combinator is to first define the combinators for dealing with lexemes, and then use those to define the parsers for syntactic elements. Whitespace is handled by having lexeme parsers consume and discard all trailing whitespace.
Because we want to preserve comment information, instead of having the whitespace parser discard all of its information, we will have it preserve the comment information that we are interested in.
When we parse a single line of a comment, we will consume the trailing whitespace from the concluding newline through either the end of the next line (not including the newline) or the first non-whitespace character, whichever comes first.
type Parser = P.Parsec Void Text
commentP :: Parser Comment
commentP = do
void $ P.char ';'
content <- P.takeWhileP (Just "character") (/= '\n')
P.skipOptional P.newline <* P.skipMany P.hspace1
pure $ Comment contentA block of comments is just one or more single comment lines.
commentBlockP :: Parser (NonEmpty Comment)
commentBlockP = some1 commentPAnd comment blocks can be separated by lines of only whitespace.
commentBlocksP :: Parser [NonEmpty Comment]
commentBlocksP = fmap catMaybes . many . P.choice $
[ Just <$> commentBlockP
, Nothing <$ P.space1
]Given these definitions, we make a new definition of space that serves the
same purpose as the one from Megaparse.Char.Lexer, except that it also
produces the comments it encountered while consuming whitespace.
space :: Parser CommentSection
space = P.hidden $ do
P.skipMany P.hspace1
sameLineC <- P.optional commentP
cs <- commentBlocksP
pure $ CommentSection sameLineC csWe define a similar variants of lexeme and symbol. Since token parsers
consume trailing whitespace, parsing a token produces the comments included in
that trailing whitespace.
lexeme :: Parser a -> Parser (a, CommentSection)
lexeme p = do
x <- p
cs <- space
pure (x, cs)
symbol :: Text -> Parser CommentSection
symbol = fmap snd . lexeme . P.stringWith that, we define parsers for the tokens in our language.
lparenP :: Parser CommentSection
lparenP = symbol "("
rparenP :: Parser CommentSection
rparenP = symbol ")"
funP :: Parser CommentSection
funP = symbol "fun"
dotP :: Parser CommentSection
dotP = symbol "fun"
colonP :: Parser CommentSection
colonP = symbol "::"
identifierP :: Parser (Text, CommentSection)
identifierP = lexeme (fst <$> P.match (P.letterChar *> many P.alphaNumChar))
literalP :: Parser (Integer, CommentSection)
literalP = lexeme Lexer.decimalThe definition of the AST is standard, except for one field on each of the Abs
and App constructors which are used to capture information about the comments
that appear within those syntactic constructs.
Despite including the comments that follow a lexical token with the parsing of
that token, when representing the AST it is easier to work only with the
comments contained within the syntactic element (in part due to how the
prettyprinter library handles indentation and newlines, which we’ll explain
later).
The inclusion only of the comments that occur within the syntactic elements is
why the Lit and Var constructors have no comment field.
For AbsComments and AppComments, the comment fields is named after the
preceding syntactic element. Because only the comments contained within the
overall syntactic construct are included, there is an lparenComments field,
but no rparenComments field. The fields that would trail the right parenthesis
are instead included in the containing construct.
type Var = Text
data Exp
= Lit Integer
| Var Var
| Abs AbsComments Var Exp
| App AppComments Exp Exp
deriving (Show)
data AbsComments = AbsComments
{ lparenComments :: CommentSection
, keywordComments :: CommentSection
, boundNameComments :: CommentSection
, bodyComments :: CommentSection
}
deriving (Show)
data AppComments = AppComments
{ lparenComments :: CommentSection
, functionComments :: CommentSection
, argumentComments :: CommentSection
}
deriving (Show)The overall program starts at the beginning of the input and goes to the end of
the input, and so includes all leading and trailing comments outside of the
expressions being parsed. Leading comments can’t occur on the same line as some
preceding code, because there is no preceding code, so the leading comments are
just a list of Comment blocks instead of a CommentSection.
data Program = Program [NonEmpty Comment] Exp CommentSection
deriving (Show)As with the parsers for the lexical tokens, the parsers for the syntactic elements produce both the parsed value and the trailing comment. With the literal and variable parsers, this is straightforward: the parsers just wrap the values with the appropriate constructors.
litP :: Parser (Exp, CommentSection)
litP = do
(n, c) <- literalP
pure (Lit n, c)
varP :: Parser (Exp, CommentSection)
varP = do
(v, c) <- identifierP
pure $ (Var v, c)For abstraction and application, we first define the parsers for the content within the parentheses. Because we aren’t parsing the parentheses here, we don’t have the comment following the left parenthesis and so must take it as an argument. We also don’t need to return an extra comment, because the comment trailing the last token parsed is still within the construct.
absP :: Parser (CommentSection -> Exp)
absP = do
kwc <- funP
(x, xc) <- identifierP
(e, ec) <- expP
pure $ \lp -> Abs (AbsComments lp kwc xc ec) x e
appP :: Parser (CommentSection -> Exp)
appP = do
(f, fc) <- expP
(e, ec) <- expP
pure $ \lp -> App (AppComments lp fc ec) f eWe then use the withParensP helper to handle parsing the parentheses,
providing the comment trailing the left parenthesis and producing the comment
trailing the right parenthesis.
parensP :: Parser a -> Parser (CommentSection, a, CommentSection)
parensP p = (,,) <$> lparenP <*> p <*> rparenP
withParensP :: Parser (CommentSection -> b) -> Parser (b, CommentSection)
withParensP p = do
(lc, f, rc) <- parensP p
pure $ (f lc, rc)With that, we can assemble our expression parser and program parser.
expP :: Parser (Exp, CommentSection)
expP =
asum
[ withParensP (asum [absP, appP])
, varP
, litP
]
programP :: Parser Program
programP = do
P.skipMany P.hspace1
leadingC <- commentBlocksP
(e, trailingC) <- expP
pure $ Program leadingC e trailingCPretty-Printing with Comments
Now that we can parse programs while preserving comment information, we can move
on to rendering the syntax in our desired format. We will use prettyprint
Wadler-Leijen pretty printing library to handle the actual layout.
First, we define how to pretty-print the comments. This is mostly straightforward,
except that the final newline of a comment block has to be omitted. The nest
function from the prettyprint library does not affect indentation until after
the next newline (rather than affecting indentation of the first character after
the newline). Therefore, if we want the indentation of anything following a
comment to differ from the indentation of the comment, we need to be able to
emit the newline after the close of the indentation group.
instance Pretty Comment where
pretty c = (";" <>) . pretty . Text.stripEnd $ commentContent c
prettyList cs =
if null cs
then mempty
else mconcat $ intersperse PP.hardline (map pretty cs)
prettyCommentBlocks :: [NonEmpty Comment] -> Doc a
prettyCommentBlocks =
mconcat . PP.punctuate (PP.hardline <> PP.hardline) . map pretty
instance Pretty CommentSection where
pretty (CommentSection sl cs) =
mconcat
[ maybe mempty ((PP.space <>) . pretty) sl
, if null cs then mempty else PP.hardline
, prettyCommentBlocks cs
]Despite preserving this ability, in our choice of pretty-printing below, we don’t end up needing it, and so we define the following helpers that we can use. However, if you want to have the closing parenthesis of an expression at a different indentation level than the indentation level of the previous part of the expression, then you will need to render the comment blocks and the final newline separately.
isEmpty :: CommentSection -> Bool
isEmpty (CommentSection sl cs) = isNothing sl && null cs
prettyCommentSectionOrElse :: Doc ann -> CommentSection -> Doc ann
prettyCommentSectionOrElse alt cb =
mconcat
[ pretty cb
, if isEmpty cb then alt else PP.hardline
]
prettyCommentSection :: CommentSection -> Doc ann
prettyCommentSection = prettyCommentSectionOrElse mempty
hasComment :: Exp -> Bool
hasComment e = case e of
Lit _ -> False
Var _ -> False
Abs (AbsComments lpc kwc vc bc) _ b ->
or
[ not (isEmpty lpc)
, not (isEmpty kwc)
, not (isEmpty vc)
, not (isEmpty bc)
, hasComment b
]
App (AppComments lpc fc xc) f x ->
or
[ not (isEmpty lpc)
, not (isEmpty fc)
, not (isEmpty xc)
, hasComment f
, hasComment x
]With all of this, the actual rendering is straightforward. Most of the implementation is about choices of how the rendered code should look:
- What should the indentation level be? For this, I’ve chosen values that I think look nice.
- Should a space be turned into newline (i.e., should we use
spaceorsoftine) if there isn’t enough room? We don’t in some cases because it would not move the expression any further to the left.
…and so on.
instance Pretty Exp where
pretty e = PP.group $ case e of
Lit n -> pretty n
Var v -> pretty v
Abs (AbsComments lpc kwc vc bc) v b ->
let vbBreak =
if not (isEmpty kwc) || hasComment b
then PP.hardline
else PP.softline
in mconcat $
[ PP.nest 1 . mconcat $
[ "("
, prettyCommentSection lpc
, "fun"
]
, PP.nest 5 . mconcat $
[ prettyCommentSectionOrElse PP.space kwc
, pretty v
]
, PP.nest 3 . mconcat $
[ prettyCommentSectionOrElse vbBreak vc
, pretty b
, prettyCommentSection bc
, ")"
]
]
App (AppComments lpc fc xc) f x ->
let fxBreak =
if hasComment f || hasComment x
then PP.hardline
else PP.softline
in mconcat $
[ "("
, PP.align . mconcat $
[ prettyCommentSection lpc
, pretty f
, prettyCommentSectionOrElse fxBreak fc
, pretty x
, prettyCommentSection xc
, ")"
]
]
instance Pretty Program where
pretty (Program leadingC e trailingC) =
mconcat
[ prettyCommentBlocks leadingC
, if null leadingC then mempty else PP.hardline
, pretty e
, prettyCommentSectionOrElse PP.hardline trailingC
]We are subject to the usual limitations of Wadler-Leijen pretty-printers in terms of where the layout algorithm will decide to insert line breaks. For example, when rendering our original example at 40 columns wide, the break is not in the ideal location:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x (f (x x))) (fun x (f (x
x)))))
Depending on the desired output for your pretty-printer, you might need to do
additional wrangling of prettyprint, use a different rendering algorithm (such
as a Bernardy
pretty-printer)
or use a custom algorithm.