Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

View .docx file on Github and use git diff on .docx file format

Tags:

git

github

I have two questions:

  1. Is there any way to view a .docx file on Github? We have uploaded all of our assignments onto Github, but there is no way we can view it within the browser. It would be nice if we could view those .docx files in the browser without downloading the file.

  2. How can I use git diff on the .docx file format? I tried to use catdoc but it didn't work for me. I think I have used git diff on Windows for the .doc format before, but it's not working for me on Mac.

Thanks a lot.

like image 321
r3b00t Avatar asked Mar 16 '14 16:03

r3b00t


People also ask

Can git diff word documents?

With Word Diff you can use Git's native cryptographic diff functionality - which ensures the authenticity and integrity of a document - to quickly verify what's changed in a given iteration, or compare different versions of the document over time, all with a single click.

How do you tell the difference in files in git?

You can run the git diff HEAD command to compare the both staged and unstaged changes with your last commit. You can also run the git diff <branch_name1> <branch_name2> command to compare the changes from the first branch with changes from the second branch.

Can you add DOCX to git?

GitHub is just a back end repository store that can be used to store Git data. Microsoft Word Documents have in-band markup, along with the text data, and are stored as binary files — this is true for both . doc and . docx files.

How do I convert a DOCX file without losing format?

Select the folder where you want to save your document. The dialog box will open > Select "Save as" > In the "Save as type" menu > Select the option "Word document (. docx)" > Click on the "Save as" button and a copy of your file will be saved in Docx format. I hope the information is useful.


3 Answers

  1. Answer for second part of question. Already an old post but popping up in top 10 without an answer. With the following settings you get a poor man's diff on docx files.

In .gitattributes use:

*.docx diff=zip

In .git/config use:

[diff "zip"]
      textconv = unzip -c -a

As a bonus my settings for old word/excel and new word/excel:

In .gitattributes use:

*.doc diff=word
*.xsl diff=excel
*.xlsx diff=zip
*.docx diff=zip

In .git/config use:

[diff "word"] 
    textconv = strings
[diff "excel"]
    textconv = strings
[diff "zip"]
    textconv = unzip -c -a
like image 137
Axe Avatar answered Oct 21 '22 01:10

Axe


Answering your second question -

Usually when you try

git diff filename.docx

you will get output of the form -

Binary files a/filename.docx and b/filename.docx differ

Not very helpful. A perfect way around that is to use Pandoc.

  • Install Pandoc from above link on your system.
  • Create or edit file ~/.gitconfig (linux, Mac) or "c:\Documents and Settings\user.gitconfig" (Windows) to add (or use git config --global --edit)

    [diff "pandoc"]
         textconv=pandoc --to=markdown
         prompt = false
    [alias]
         wdiff = diff --word-diff=color --unified=1`
    
  • In your git controlled directory with .docx files, create or edit file .gitattributes (linux, Windows and Mac) to add

    *.docx diff=pandoc
    
  • You can commit .gitattributes so that it stays for use in other computers, but you'll need to edit ~/.gitconfig in every new computer you want to use.

  • Now you can see a pretty coloured diff with the changes you have made to your .docx file since the last commit

     git wdiff file.docx
    

More details can be found here.

like image 45
Crygnus Avatar answered Oct 21 '22 01:10

Crygnus


The accepted solution (using strings / unzip ) didn't work very well for me on Linux Mint 19.3. The following seems to work pretty well for most doc/docx/rtf/xls files as well as their LibreOffice counterparts. Some of these might work on Windows via cygwin/git bash but I have not tested; if the packages I mention are not available in cygwin/git bash, then I would look for python/perl scripts that do the same conversion and substitute with those instead.

  1. Install prerequisites: sudo apt install git pandoc catdoc odt2txt.
  2. Note that catdoc and odt2txt include multiple tools for handling doc/xls/ppt/odt/ods/odp formats not just the ones in the package name. Likewise, pandoc handles all of the newer zipped 'x' formats.
  3. I wanted my attributes to apply as Global (e.g. User-scoped) rather than per-project as done in the other answers. To create User-scoped git attributes file, use mkdir ~/.config/git/ && touch ~/.config/git/attributes (on Windows this should be mkdir "%USERPROFILE%\.config\git" && echo "" > "%USERPROFILE%\.config\git\attributes")
  4. Setup git attributes file (either the user-scoped file mentioned in the previous step or the project-scoped file ${projectDir}/.git/info/attributes as desired):
    # handle windows *.reg files (utf-16 which git doesn't normally like)
    *.reg diff=utf16

    # handle misc common document formats
    *.pdf diff=pdf
    *.rtf diff=catdoc

    # handle libre/open document formats
    *.ods diff=ods2txt
    *.odp diff=odp2txt
    *.odt diff=odt2txt

    # handle older common ms document formats
    # note: ppt did not work for me
    *.doc diff=catdoc
    *.ppt diff=catppt
    *.xls diff=xls2csv

    # handle newer zipped ms document formats
    # note: pptx and xlsx did not work for me
    *.docx diff=pandoc
    *.pptx diff=pandoc
    *.xlsx diff=pandoc
  1. Create .gitconfig definitions (either in the user-scoped ~/.gitconfig or in the project-scoped ${projectDir}/.git/config). Much of this is based on this article but altered based on my own testing.
[core]
        autocrlf = false
    [diff]
        guitool = kdiff3
    [diff "odp2txt"]
        textconv = odp2txt
        binary = true
    [diff "odt2txt"]
        textconv = odt2txt
        binary = true
    [diff "ods2txt"]
        textconv = ods2txt
        binary = true
    [diff "catdoc"]
        textconv = catdoc
        binary = true
    # note catppt did not work for me
    [diff "catppt"]
        textconv = catppt
        binary = true
    [diff "xls2csv"]
        textconv = xls2csv
        binary = true
    [diff "xlsx2csv"]
        textconv = xlsx2csv
        binary = true
    [diff "pandoc"]
        textconv=pandoc --to=markdown
        prompt = false
    [diff "pdf2txt"]
        textconv=pdf2txt
        binary = true
    [diff "utf16"]
        textconv = iconv -c -f UTF-16LE -t ASCII

I was never able to successfully get diffs working for xlsx, ppt, or pptx even after downloading the latest version of pandoc from their github page. The docx conversion worked fine even with the super old version that is in the Mint/Ubuntu/Debian repos (v1.19.2.4 from 2016). For the xlsx/pptx samples I was using, I always got either "Invalid UTF-8 stream fatal" (old version) or "UTF-8 decoding error" (new version).

This could have been due to the sample files I was using (some samples from the web and some samples I created by converting LibreOffice documents), my system setup, the versions I was using or something else.

For completeness, after installing the newer pandoc, I was using:

$ uname -vipor
5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 GNU/Linux

$ dpkg -l catdoc odt2txt pandoc git xlsx2csv|grep '^ii'
ii  catdoc         1:0.95-4.1          amd64        text extractor for MS-Office files
ii  git            1:2.17.1-1ubuntu0.5 amd64        fast, scalable, distributed revision control system
ii  odt2txt        0.5-1build2         amd64        simple converter from OpenDocument Text to plain text
ii  pandoc         2.9.2-1             amd64        general markup converter
ii  xlsx2csv       0.20+20161027+git5785081-1 all          convert xslx files to csv format

EDIT: Also tried using the package xlsx2csv for xlsx conversion instead of pandoc and I had issues with that as well. Could be something to do with my samples but since I am not really doing anything special to create them I would consider that a coverage-gap / limitation of xlsx2csv/pandoc if so.

like image 31
zpangwin Avatar answered Oct 21 '22 00:10

zpangwin