Class 21: Structured documents

Methodology of Scientific Research

Andrés Aravena, PhD

May 14, 2024

Structured documents

You probably know that using a good data structure can dramatically improve an algorithm

And you use structured programs

The same applies to structuring our documents

Maybe you have used LaTeX, or Markdown

Maybe you know HTML

Separation of concerns

Separate style from structure

Describe the role of text, not the “looks”

The key idea is to describe what things are, not how they look

This part is based on the ideas discussed in “LaTeX: A Document Preparation System” by Leslie Lamport (1986).

It is like a house

Structure makes the house solid and comfortable

If you only do decoration, the house looks nice but it is not solid

Structure of the walls come first

Painting the walls in a nice color is secondary

Structural elements

  • Sections, subsections, paragraphs
  • Figures and Tables
  • Lists
  • References
  • Equations
  • Metadata
    • Title
    • Authors
    • Affiliations
    • Dates: submission, acceptance

Microsoft Word

The first tool we learn today is a WYSIWYG word processor

In word processors like Word®,
What You See Is What You Get

This is sometimes called WYSIWYG

It is easy to change fonts, sizes, colors and other visual attributes, without paying attention to structure

Style is not Structure

You can follow the same philosophy:

  • Separate style from structure

  • Focus on content

Structured Word documents

Now the document has structure

Structure without style

Historical note

Mechanical typewriters were invented in 1874

They had only one font

We still use the same keyboard

Using UPPERCASE and underline for emphasis

Early computers had only text, no graphics

Giving style to plain text

Since there was only one type of letter, people used some symbols as “magic”

For example \ or @

If you write a “magic” symbol, you tell the computer that the next symbol shows a change of format

This is called Markup Language

TeX

An important system to prepare documents in the computer was invented in the 70’s by Donald Knuth, who is probably the most important computer scientist of the last 70 years.

Donald Knuth won the Turing Award in 1974

Knuth invented TeX to write this

LaTeX

TeX has styles but not structure. In the 80’s Leslie Lamport created LaTeX as an extension of TeX

Leslie Lamport won the Turing Award in 2013

Example: writing in LaTeX

A LaTeX document looks like this

\documentclass[a4paper]{article}
\title{Ten Simple Rules for Online Learning}
\author{David B. Searls}
\date{13 September 2012}
\begin{document}
\section{Rule 1: Make a Plan}
There are many possible motivations for becoming involved in online learning…
\end{document}

LaTeX files are text files. They will never be obsolete.

Changing the documentclass will change the document look

Advantages of LaTeX

  • It is free

  • it forces you to think logically and organize your ideas

  • Write first, compile later

  • Do not waste time playing with fonts

  • Good journals accept LaTeX submissions
    (they also accept Microsoft Word format)

LaTeX files are text files

  • Independent of any provider

  • Use your favorite text editor (VScode?)

  • Version control friendly (GitHub?)

  • Can probably still be read 20 years from now

We cannot say the same about Microsoft Word

The real advantage: it looks correct

According to the author of LaTeX

The main mistake that people should stop making is

Worrying too much about formatting and not enough about content.

“How (La)TeX changed the face of Mathematics”. An E-interview with Leslie Lamport. http://lamport.azurewebsites.net/pubs/lamport-latex-interview.pdf

Bonus: Slides for presentations

After writing your paper, you will probably present it
(or maybe before finishing it)

Using structured document makes it easy to recycle your material to presentation slides

In LaTeX you can do that using the beamer document class

Good ideas in LaTeX

  • Chapters, sections, subsections
  • Automatic creation of Table of Contents
  • Automatic numbering of sections, figures, tables
  • Cross referencing sections, figures, tables
  • Floating figures
  • Math formulas
  • Bibliographic references

Writing Math Expressions

LaTeX is favored by people who writes mathematical formulas

$$(a+b)^n=\sum_{k=0}^n \frac{n!}{k!(n-k)!} a^k b^{n-k}$$

\[(a+b)^n=\sum_{k=0}^n \frac{n!}{k!(n-k)!} a^k b^{n-k}\]

You can use this syntax in Microsoft Word’s Equation Editor

Learning how to write math is a good investment

Bibliographic References

There are hundreds of citation styles

Life is too short to sort references manually

LaTeX also provides a convenient way to handle references

References are stored in a separate text file, in BiBTeX format

Many tools can create BiBTeX files for you

  • Zotero
  • Mendeley

LaTeX disadvantages

  • LaTeX is hard to learn
    • This discourages many people
    • Your collaborators may not use it
    • You need to have the Reference Manual at hand
  • It is oriented to producing printed material
    • It produces PDF files or equivalents
    • Not suitable for Web or eBook
  • Writing tables is hard

Web pages

Web Pages

In the 90’s most computers had good graphic capabilities and Internet access

Researchers at CERN invented the web, using “hyper-text”

(That is, text with links to other text)

Web pages are written in Hyper Text Markup Language

HTML

These are also text files. It looks like this:

<head>
<title>Ten Simple Rules for Online Learning</title>
</head>
<body>
<h1>Rule 1: Make a Plan</h1>
There are many possible motivations for becoming involved in online learning…
</body>

Good ideas from HTML

  • Works well on the screen: adapts to screen size

  • Links to other pages

  • Structural elements

    • <h1>…</h1> marks Header level 1
    • There are also <h2><h6>
  • Comments: <!-- this part is not shown -->

  • Structure separated from Style

    • Style is defined in CSS files

Disadvantages of HTML

  • It does not work well for paper

  • It is hard to write manually

  • There are editors, but they often focus on style, not structure

Alternative: Markdown

It is a light markup system that can be easily converted into nice presentations

---
title: Ten Simple Rules for Online Learning
author: David B. Searls
date: 13 September 2012
...

# Rule 1: Make a Plan

There are many possible motivations for becoming involved in online learning…

Text documents are good

Text files are for humans and computers

  • Binary files are hard to read
    • unless you have the correct program
  • Text files can be read by humans
    • Each byte is a letter
  • Text files can be read by computers
    • Data must be recyclable
    • The output of one program may be the input of another program

Text editors instead of Word processors

The easiest way to handle text files is to use a text editor

These are programs to view and edit text files

They use a monospaced font, like Courier

Each letter has the same width

Text editor have syntax coloring

Since each letter has the same size, text editor use color

The color depends on the role of each text

For example, headings can be in red color

The color is not in the file. The editor puts colors

Text editors handling Markdown

These work with Markdown and other formats

All are good. We use VSCode

Online Markdown editors

Text files are for ever

Free

  • nothing to pay

  • you can do whatever you want

Never get obsolete

But they do not have structure

Structured Documents

We want to identify the meaning, not the shapes

  • Title
  • Sections
    • Subsections
      • Lists
      • Figures
      • Tables
  • References to other works

Separation of concerns

The key idea is to describe what things are, not how they look

Describe the role of text, not the “looks”

Separate style from structure

Text files with structure

There are several markup languages that encode the structure of a text document

  • LaTeX
  • ReStructured Text
  • MediaWiki
  • HTML
  • Markdown
  • Textile
  • AsciiDoc

Markdown

Markdown is a widely used markup language

  • Same philosophy as LaTeX, but simpler

  • The text file can be read and understood easily

  • It can be transformed into other formats

    • PDF, Word, Webpage (HTML)
  • Used in R, Python, Julia (Jupyter), in GitHub, and many other modern platforms

Markdown’s author says:

“The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible.

“The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.”

Flavors of Markdown

Compiling is transforming from Markdown to other format

There are many different Markdown compilers

Many people make their own compiler, and they expand the original idea

Unfortunately, they are not always 100% compatible

There is not yet an official standard

Recommendation: pandoc

(if you have RStudio, you have Pandoc)

Pandoc

If you need to convert files from one markup format into another, pandoc is your swiss-army knife

John MacFarlane, developer of Pandoc

Pandoc can convert between many formats, including

  • Markdown
  • Microsoft Word/Powerpoint
  • LaTeX
  • Jupyter notebook

John MacFarlane

Professor of Philosophy, University of California, Berkeley

Author of books

  • Philosophical Logic: A Contemporary Introduction
  • Assessment Sensitivity: Relative Truth and Its Applications

Recent papers:

  • “Lecture I: Vagueness and Communication”
  • “Lecture II: Seeing Through the Clouds”
  • “Lecture III: Indeterminacy as Indecision”
  • “On Probabilistic Knowledge”

Pandoc advantages

  • Text files

  • It is easy to write tables in Markdown

  • It is easy to write lists

  • Can be used for slides

    • Several web platforms (like this document)
    • Microsoft Powerpoint
  • Handles BiBTeX references

Using Pandoc

It is a command line command, and can be used inside VSCode

There is even a plugin

In the command line we write

pandoc document.md --output document.pdf

and there are many options. See https://pandoc.org

Markdown format

Paragraphs

  • Consecutive lines of text are one paragraph.
  • They are separated by an empty line
The first paragraph.

Another paragraph

The first paragraph.

Another paragraph

Headers

# Header 1
## Header 2
### Header 3
#### Header 4

Header 1

Header 2

Header 3

Header 4

Unordered Lists

+ Item 1
+ Item 2
    + Item 2a
    + Item 2b
  • Item 1
  • Item 2
    • Item 2a
    • Item 2b

Sub-lists are indented by 4 spaces

Ordered Lists

1. Item 1
1. Item 2
1. Item 3
    1. Item 3a
    1. Item 3b
  1. Item 1
  2. Item 2
  3. Item 3
    1. Item 3a
    2. Item 3b

Images

You have to indicate the web address of the image

![optional text](http://example.com/logo.png)

or the name of a file in the same directory

![optional text](images/logo.png)

optional text

Optional text is shown when the image is not found

![optional text](images/logo.pn)

optional text

Figures with Captions

This is a pandoc extension, not standard Markdown

If the figure is a paragraph (has empty lines before and after_then the_optional text_ becomes the caption,

![This is the caption of the figure.](images/logo.png)
This is the caption of the figure.

Tables

There are several formats. The easiest one is this

|   | sample   | dose | time  | agent            |
|---|----------|------|-------|------------------|
| 1 | GSM91440 | low  | 5 min | caffeine         |
| 2 | GSM91893 | low  | 5 min | caffeine         |
| 3 | GSM91428 | low  | 5 min | calcofluor white |
| 4 | GSM91881 | low  | 5 min | calcofluor white |
sample dose time agent
1 GSM91440 low 5 min caffeine
2 GSM91893 low 5 min caffeine
3 GSM91428 low 5 min calcofluor white
4 GSM91881 low 5 min calcofluor white

Tables with captions (pandoc extension)

Write Table: and the caption just after the table

|   | sample   | dose | time  | agent            |
|---|----------|------|-------|------------------|
| 1 | GSM91440 | low  | 5 min | caffeine         |
| 2 | GSM91893 | low  | 5 min | caffeine         |
| 3 | GSM91428 | low  | 5 min | calcofluor white |
| 4 | GSM91881 | low  | 5 min | calcofluor white |

Table: This is the table caption
This is the table caption
sample dose time agent
1 GSM91440 low 5 min caffeine
2 GSM91893 low 5 min caffeine
3 GSM91428 low 5 min calcofluor white
4 GSM91881 low 5 min calcofluor white

Making tables

There are some VSCode plug-ins that can make tables for you

Or you can make them in R using knitr or pander libraries

A good alternative is this website:

https://www.tablesgenerator.com/markdown_tables

Computer code

Programs are usually written in a monospaced font.
That is, all letters have the same width.

```
this <- is.computer(code) {
    # comment
}
```
this <- is.computer(code) {
    # comment
}

Nicer computer code

You can indicate the language, and get colors

```r
this <- is.computer(code) {
    # comment
}
```
this <- is.computer(code) {
    # comment
}

Format inside a paragraph

Footnotes

Here is a footnote reference,[^1] and another.[^longnote]

[^1]: Here is the footnote.

[^longnote]: Here's one with multiple blocks.

    Subsequent paragraphs are indented to show that they
belong to the previous footnote.

This paragraph won't be part of the note, because it
isn't indented.

Here is a footnote reference,1 and another.2

This paragraph won’t be part of the note, because it isn’t indented.

Inline code

We can compare `x` and `data`

We can compare x and data

Emphasis

Use it only when strictly necessary

Inside the paragraph we can have *italics*
and **bold** text

Inside the paragraph we can have italics and bold text

Comments in Pandoc

Pandoc can understand some HTML

If we wite an HTML comment, it will not show in the output

<!-- this part does not show -->

(Must use pandoc option --strip-comments)

Online resources

For your weekend


  1. Here is the footnote.↩︎

  2. Here’s one with multiple blocks.

    Subsequent paragraphs are indented to show that they belong to the previous footnote.↩︎