word-document
Microsoft Word (.docx) 문서 생성, 읽기, 편집 — docx 라이브러리로 리포트/템플릿/편지 생성, Pandoc으로 텍스트 추출/변환
Word Document (.docx) Processing
Create, read, and manipulate Microsoft Word (.docx) documents.
When to Use
- Generate Word reports (weekly status, audit summaries) from structured data.
- Produce standardized memos/letters/templates with consistent formatting.
- Insert images, headers/footers, page numbers, page breaks, and table of contents.
- Extract text from existing
.docxfiles for indexing, review, or conversion to Markdown. - Create invoices, contracts, or agreements from templates.
Key Features
- Create
.docxdocuments using thedocx(docx-js) Node.js library. - Read/extract
.docxcontent via Pandoc conversion. - Explicit page setup (US Letter vs A4), margins, and landscape handling.
- Style control (default font, overriding built-in Heading styles for TOC compatibility).
- Proper list generation (bullets/numbering via numbering config).
- Robust table rendering (DXA widths, column widths + cell widths, shading, borders).
- Images with required metadata and transformations.
- Page breaks, headers/footers, and page numbering.
- Table of contents generation based on Word heading levels.
Dependencies
- Node.js: >= 18
- docx:
^9.0.0(npm i docx) - pandoc:
>= 2.0(CLI tool for extracting/converting.docxcontent)
Usage
1) Create a .docx (Node.js)
Install
npm i docx
create-doc.js
const fs = require("fs");
const {
Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell,
ImageRun, Header, Footer, AlignmentType, BorderStyle, WidthType,
ShadingType, PageNumber, PageBreak, TableOfContents, HeadingLevel,
LevelFormat,
} = require("docx");const US_LETTER = { width: 12240, height: 15840 }; // DXA (1440 = 1 inch)
const MARGINS_1IN = { top: 1440, right: 1440, bottom: 1440, left: 1440 };
const CONTENT_WIDTH = US_LETTER.width - MARGINS_1IN.left - MARGINS_1IN.right; // 9360
const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" };
const borders = { top: border, bottom: border, left: border, right: border };
const doc = new Document({
styles: {
default: {
document: { run: { font: "Arial", size: 24 } }, // 12pt
},
paragraphStyles: [
{
id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 32, bold: true, font: "Arial" },
paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 },
},
{
id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true,
run: { size: 28, bold: true, font: "Arial" },
paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 },
},
],
},
numbering: {
config: [
{
reference: "bullets", levels: [{
level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } },
}],
},
{
reference: "numbers", levels: [{
level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT,
style: { paragraph: { indent: { left: 720, hanging: 360 } } },
}],
},
],
},
sections: [{
properties: {
page: { size: { width: US_LETTER.width, height: US_LETTER.height }, margin: MARGINS_1IN },
},
headers: {
default: new Header({ children: [new Paragraph({ children: [new TextRun("Sample Header")] })] }),
},
footers: {
default: new Footer({ children: [
new Paragraph({ children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })] }),
] }),
},
children: [
new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("DOCX Creation Example")] }),
new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }),
new Paragraph({ heading: HeadingLevel.HEADING_2, children: [new TextRun("Lists")] }),
new Paragraph({ numbering: { reference: "bullets", level: 0 }, children: [new TextRun("Bullet item")] }),
new Paragraph({ numbering: { reference: "numbers", level: 0 }, children: [new TextRun("Numbered item")] }),
new Paragraph({ heading: HeadingLevel.HEADING_2, children: [new TextRun("Table")] }),
new Table({
width: { size: CONTENT_WIDTH, type: WidthType.DXA },
columnWidths: [CONTENT_WIDTH / 2, CONTENT_WIDTH / 2],
rows: [
new TableRow({ children: [
new TableCell({
borders, width: { size: CONTENT_WIDTH / 2, type: WidthType.DXA },
shading: { fill: "D5E8F0", type: ShadingType.CLEAR },
margins: { top: 80, bottom: 80, left: 120, right: 120 },
children: [new Paragraph("Left cell")],
}),
new TableCell({
borders, width: { size: CONTENT_WIDTH / 2, type: WidthType.DXA },
shading: { fill: "FFFFFF", type: ShadingType.CLEAR },
margins: { top: 80, bottom: 80, left: 120, right: 120 },
children: [new Paragraph("Right cell")],
}),
] }),
],
}),
new Paragraph({ children: [new PageBreak()] }),
new Paragraph({ heading: HeadingLevel.HEADING_2, children: [new TextRun("New Page Section")] }),
new Paragraph({ children: [new TextRun("This is on a new page.")] }),
],
}],
});
(async () => {
const buffer = await Packer.toBuffer(doc);
fs.writeFileSync("output.docx", buffer);
console.log("Wrote output.docx");
})();
Run
node create-doc.js
2) Read/extract .docx text with Pandoc
pandoc document.docx -o output.md
3) Python alternative (python-docx)
pip install python-docx
from docx import DocumentCreate
doc = Document()
doc.add_heading('Title', level=1)
doc.add_paragraph('Hello World')
doc.save('output.docx')Read
doc = Document('input.docx')
for para in doc.paragraphs:
print(para.text)
Implementation Details
- DOCX structure:
.docxis a ZIP archive containing XML parts. - Page sizing (DXA): 1440 DXA = 1 inch. US Letter: 12240 x 15840 DXA. A4: 11906 x 16838 DXA.
- Landscape: Provide portrait dimensions and set
orientation: PageOrientation.LANDSCAPE. - Headings and TOC: TOC requires
HeadingLevel.*. Override styles with exact IDs:"Heading1","Heading2". IncludeoutlineLevel. - Paragraphs: Do not embed
\nin runs — create separateParagraphelements. - Lists: Use
numbering.configwithLevelFormat.BULLET/LevelFormat.DECIMAL. Same reference continues numbering; different reference restarts. - Tables: Always use
WidthType.DXA. Set tablewidthandcolumnWidths(sum must equal table width). Set each cellwidthto match its column width. - Images:
ImageRunrequires validtypeandaltTextfields. - Page breaks:
PageBreakmust be inside aParagraph(or usepageBreakBefore: true).
Troubleshooting
| Problem | Fix |
|---------|-----|
| Cannot find module 'docx' | Run npm i docx in working directory |
| pandoc: command not found | Install: brew install pandoc |
| TOC empty in Word | Open doc in Word and press Ctrl+A then F9 to update fields |
| Table widths broken in Google Docs | Use WidthType.DXA instead of percentages |
| python-docx not found | pip install python-docx |