使用PHP将(.rtf | .doc)文件快速转换为Markdown语法

我已经手动将文章转换为Markdown语法几天,现在变得相当乏味。 其中一些是3或4页,斜体和其他强调的文字贯穿始终。 有没有更快的方法转换(.rtf | .doc)文件来清理Markdown语法,我可以利用?

如果你碰巧在mac上, textutil在将doc,docx和rtf转换为html方面做得很好,而且pandoc在将html转换为markdown方面做得很好:

 $ textutil -convert html file.doc -stdout | pandoc -f html -t markdown -o file.md 

我有一个脚本 ,我扔了一段时间,试图使用textutil,pdf2html,和潘多克转换任何我扔在减价。

ProgTips有一个Wordmacros(源代码下载)的可能的解决scheme:

一个简单的macros(源下载)自动转换最微不足道的东西。 这个macros确实:

  • replace粗体和斜体
  • replace标题(标题1-6)
  • replace编号和项目符号列表

这是非常错误的,我相信它挂在较大的文件,但是我不说它是一个稳定的版本无论如何! :-)只有实验性使用,只要你喜欢,重新编码和重用,如果你find了更好的解决scheme发表评论。

来源: ProgTips

macros源

安装

  • 打开WinWord,
  • 按Alt + F11打开VBA编辑器,
  • 右键单击项目浏览器中的第一个项目
  • selectinsert->模块
  • 粘贴文件中的代码
  • closuresmacros编辑器
  • 去工具>macros>macros; 运行名为MarkDown的macros

来源: ProgTips

资源

如果ProgTips删除post或网站被删除,则安全保留的macros源代码:

 '*** A simple MsWord->Markdown replacement macro by Kriss Rauhvargers, 2006.02.02. '*** This tool does NOT implement all the markup specified in MarkDown definition by John Gruber, only '*** the most simple things. These are: '*** 1) Replaces all non-list paragraphs to ^p paragraph so MarkDown knows it is a stand-alone paragraph '*** 2) Converts tables to text. In fact, tables get lost. '*** 3) Adds a single indent to all indented paragraphs '*** 4) Replaces all the text in italics to _text_ '*** 5) Replaces all the text in bold to **text** '*** 6) Replaces Heading1-6 to #..#Heading (Heading numbering gets lost) '*** 7) Replaces bulleted lists with ^p * listitem ^p* listitem2... '*** 8) Replaces numbered lists with ^p 1. listitem ^p2. listitem2... '*** Feel free to use and redistribute this code Sub MarkDown() Dim bReplace As Boolean Dim i As Integer Dim oPara As Paragraph 'remove formatting from paragraph sign so that we dont get **blablabla^p** but rather **blablabla**^p Call RemoveBoldEnters For i = Selection.Document.Tables.Count To 1 Step -1 Call Selection.Document.Tables(i).ConvertToText Next 'simple text indent + extra paragraphs for non-numbered paragraphs For i = Selection.Document.Paragraphs.Count To 1 Step -1 Set oPara = Selection.Document.Paragraphs(i) If oPara.Range.ListFormat.ListType = wdListNoNumbering Then If oPara.LeftIndent > 0 Then oPara.Range.InsertBefore (">") End If oPara.Range.InsertBefore (vbCrLf) End If Next 'italic -> _italic_ Selection.HomeKey Unit:=wdStory bReplace = ReplaceOneItalic 'first replacement While bReplace 'other replacements bReplace = ReplaceOneItalic Wend 'bold-> **bold** Selection.HomeKey Unit:=wdStory bReplace = ReplaceOneBold 'first replacement While bReplace bReplace = ReplaceOneBold 'other replacements Wend 'Heading -> ##heading For i = 1 To 6 'heading1 to heading6 Selection.HomeKey Unit:=wdStory bReplace = ReplaceH(i) 'first replacement While bReplace bReplace = ReplaceH(i) 'other replacements Wend Next Call ReplaceLists Selection.HomeKey Unit:=wdStory End Sub '*************************************************************** ' Function to replace bold with _bold_, only the first occurance ' Returns true if any occurance found, false otherwise ' Originally recorded by WinWord macro recorder, probably contains ' quite a lot of useless code '*************************************************************** Function ReplaceOneBold() As Boolean Dim bReturn As Boolean Selection.Find.ClearFormatting With Selection.Find .Text = "" .Forward = True .Wrap = wdFindContinue .Font.Bold = True .Format = True .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With bReturn = False While Selection.Find.Execute = True bReturn = True Selection.Text = "**" & Selection.Text & "**" Selection.Font.Bold = False Selection.Find.Execute Wend ReplaceOneBold = bReturn End Function '******************************************************************* ' Function to replace italic with _italic_, only the first occurance ' Returns true if any occurance found, false otherwise ' Originally recorded by WinWord macro recorder, probably contains ' quite a lot of useless code '******************************************************************** Function ReplaceOneItalic() As Boolean Dim bReturn As Boolean Selection.Find.ClearFormatting With Selection.Find .Text = "" .Forward = True .Wrap = wdFindContinue .Font.Italic = True .Format = True .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With bReturn = False While Selection.Find.Execute = True bReturn = True Selection.Text = "_" & Selection.Text & "_" Selection.Font.Italic = False Selection.Find.Execute Wend ReplaceOneItalic = bReturn End Function '********************************************************************* ' Function to replace headingX with #heading, only the first occurance ' Returns true if any occurance found, false otherwise ' Originally recorded by WinWord macro recorder, probably contains ' quite a lot of useless code '********************************************************************* Function ReplaceH(ByVal ipNumber As Integer) As Boolean Dim sReplacement As String Select Case ipNumber Case 1: sReplacement = "#" Case 2: sReplacement = "##" Case 3: sReplacement = "###" Case 4: sReplacement = "####" Case 5: sReplacement = "#####" Case 6: sReplacement = "######" End Select Selection.Find.ClearFormatting Selection.Find.Style = ActiveDocument.Styles("Heading " & ipNumber) With Selection.Find .Text = "" .Replacement.Text = "" .Forward = True .Wrap = wdFindContinue .Format = True .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With bReturn = False While Selection.Find.Execute = True bReturn = True Selection.Range.InsertBefore (vbCrLf & sReplacement & " ") Selection.Style = ActiveDocument.Styles("Normal") Selection.Find.Execute Wend ReplaceH = bReturn End Function '*************************************************************** ' A fix-up for paragraph marks that ar are bold or italic '*************************************************************** Sub RemoveBoldEnters() Selection.HomeKey Unit:=wdStory Selection.Find.ClearFormatting Selection.Find.Font.Italic = True Selection.Find.Replacement.ClearFormatting Selection.Find.Replacement.Font.Bold = False Selection.Find.Replacement.Font.Italic = False With Selection.Find .Text = "^p" .Replacement.Text = "^p" .Forward = True .Wrap = wdFindContinue .Format = True End With Selection.Find.Execute Replace:=wdReplaceAll Selection.HomeKey Unit:=wdStory Selection.Find.ClearFormatting Selection.Find.Font.Bold = True Selection.Find.Replacement.ClearFormatting Selection.Find.Replacement.Font.Bold = False Selection.Find.Replacement.Font.Italic = False With Selection.Find .Text = "^p" .Replacement.Text = "^p" .Forward = True .Wrap = wdFindContinue .Format = True End With Selection.Find.Execute Replace:=wdReplaceAll End Sub '*************************************************************** ' Function to replace bold with _bold_, only the first occurance ' Returns true if any occurance found, false otherwise ' Originally recorded by WinWord macro recorder, probably contains ' quite a lot of useless code '*************************************************************** Sub ReplaceLists() Dim i As Integer Dim j As Integer Dim Para As Paragraph Selection.HomeKey Unit:=wdStory 'iterate through all the lists in the document For i = Selection.Document.Lists.Count To 1 Step -1 'check each paragraph in the list For j = Selection.Document.Lists(i).ListParagraphs.Count To 1 Step -1 Set Para = Selection.Document.Lists(i).ListParagraphs(j) 'if it's a bulleted list If Para.Range.ListFormat.ListType = wdListBullet Then Para.Range.InsertBefore (ListIndent(Para.Range.ListFormat.ListLevelNumber, "*")) 'if it's a numbered list ElseIf Para.Range.ListFormat.ListType = wdListSimpleNumbering Or _ wdListMixedNumbering Or _ wdListListNumOnly Then Para.Range.InsertBefore (Para.Range.ListFormat.ListValue & ". ") End If Next j 'inserts paragraph marks before and after, removes the list itself Selection.Document.Lists(i).Range.InsertParagraphBefore Selection.Document.Lists(i).Range.InsertParagraphAfter Selection.Document.Lists(i).RemoveNumbers Next i End Sub '*********************************************************** ' Returns the MarkDown indent text '*********************************************************** Function ListIndent(ByVal ipNumber As Integer, ByVal spChar As String) As String Dim i As Integer For i = 1 To ipNumber - 1 ListIndent = ListIndent & " " Next ListIndent = ListIndent & spChar & " " End Function 

来源: ProgTips

如果您打算使用.docx格式,则可以使用我将这个PHP脚本放在一起来提取XML,运行一些XSL转换并输出相当不错的Markdown等价物:

https://github.com/matb33/docx2md

请注意,它意味着从命令行工作,并且在其接口中是相当基本的。 但是,它会完成工作!

如果脚本对您不够好,我鼓励您将.docx文件发送给我,以便我可以重现您的问题并修复它。 在GitHub中logging一个问题,或者如果你愿意直接与我联系。

Pandoc是一个很好的命令行转换工具,但是您首先需要将input的内容转换为Pandoc可以读取的格式,即:

  • 降价
  • reStructuredText的
  • 纺织品
  • HTML
  • 胶乳

我们有相同的问题,不得不将Word文档转换为降价。 一些是更复杂和(非常)大的文件,math方程和图像等等。 所以我做了这个脚本,使用一些不同的工具转换: https : //github.com/Versal/word2markdown

因为它使用了多个工具链,所以它更容易出错,但如果你有更复杂的文档,这可能是一个很好的起点。 希望它可以帮助! 🙂

更新:目前只适用于Mac OS X,并且需要安装一些需求(Word,Pandoc,HTML Tidy,git,node / npm)。 为了正常工作,还需要打开一个空的Word文档,并执行:文件 – >另存为网页 – >兼容性 – >编码 – > UTF-8。 然后这个编码被保存为默认值。 有关如何设置的更多详细信息,请参阅自述文件。

然后在控制台中运行:

 $ git clone git@github.com:Versal/word2markdown.git $ cd word2markdown $ npm install (copy over the Word files, for example, "document.docx") $ ./doc-to-md.sh document.docx document_files > document.md 

然后,您可以在document.md和目录document_filesfindMarkdown。

现在可能有些复杂,所以我欢迎任何能使这个工作变得简单的贡献,或者在其他操作系统上做这个工作! 🙂

你试过这个吗? 不确定function丰富,但它适用于简单的文本。 http://markitdown.medusis.com/

作为大学ruby课程的一部分,我开发了一个工具,可以将openoffice word文件(.odt)转换为markdown。 为了把它转换成正确的格式,必须做很多假设。 例如,很难确定必须考虑为标题的文本的大小。 然而,唯一的想法是,你可以放弃这种转换是格式化任何文本被满足总是附加到降价文件。 我开发的工具支持列表,粗体和斜体文本,它具有表格的语法。

http://github.com/bostko/doc2text试试看,请给我你的反馈。;