How to Convert Kiwix *.zim Files into Accessible Text Files (txt, html etc.)

nhb42 · 发表于 2020-2-23 14:04:20

本帖最后由 nhb42 于 2020-2-23 14:21 编辑

I want to make an mdx version of latest Simple English Wikipedia (with pictures) from the *.zim file provided by Kiwix: http://download.kiwix.org/zim/wikipedia_en_simple_all_maxi.zim.
But I couldn't find a way to extract the data from the zim file or convert it into txt, htm or html format.
I also tried a software named Zim:https://zim-wiki.org/downloads.html to export the zim file into htm, but couldn't do anything with it.
Is it really possible to extract the data from the zim file?
If there's a way, please let me know...
And if it's not possible, can anyone grab the original website data? (Which I found very messy to crawl, I tried Cytok WebCopy to crawl the pages but a lot of extra pages were downloaded...)

One more thing, I'm using Windows 10.

firetimer · 发表于 2020-2-23 14:04:21

Raw data archives are here. Password of them is "ensimp"

链接：https://pan.baidu.com/s/1-m8ptNbq7Fe0lfUxsU9I4w
提取码：48pf
Redirects, links, still long way to go...

喬治兄 · 发表于 2020-2-23 14:32:14

本帖最后由喬治兄于 2020-2-23 14:54 编辑

Dear Brother nhb42:
Try this
http://www.shouce.ren/post/view/id/5170
https://shazi.info/mediawiki-1-2 ... 9%A8-visual-editor/
https://www.mediawiki.org/wiki/Extension:VisualEditor
https://zim-wiki.org/

Exporting from the commandline

Try something like:

$ zim --export --output=./html \ --format=html --template=./foo.html ~/Notes

See "zim --help" for all options.

Dear Brother nhb42, you don' t have to convert the .zim file to mdx.
Goldendict support the .zim file format with fulltext search perfectly.

kriskr · 发表于 2020-2-23 19:43:06

喬治兄发表于 2020-2-23 14:32
Dear Brother nhb42:
Try this
http://www.shouce.ren/post/view/id/5170

but eudic doesnt support zim file it is best to convert it into mdx format

nhb42 · 发表于 2020-2-24 10:23:51

本帖最后由 nhb42 于 2020-2-24 10:27 编辑

喬治兄发表于 2020-2-23 14:32
Dear Brother nhb42:
Try this
http://www.shouce.ren/post/view/id/5170

actually... it didn't help me, I mentioned that I used Zim software, but all in vain. the reasons might be that I have no programming knowledge. besides, I use only MDict and do not have any intention to use other apps; and Kiwix already can read ZIM files. I just want to use all dictionaries in one app. So using MDX is a Hobson's Choice!

And a request, if you can convert it [Simple English Wikipedia] into TXT or HTML format, please feel free to share a cloud link to the converted file.

喬治兄 · 发表于 2020-2-24 16:03:23

本帖最后由喬治兄于 2020-2-24 16:09 编辑

nhb42 发表于 2020-2-24 10:23
actually... it didn't help me, I mentioned that I used Zim software, but all in vain. the reasons m ...

Brother nhb42:
Maybe you have tried the software export function, I really don't have any idea about the software.

https://zim-wiki.org/manual/Help/Export.html

Zim - A Desktop Wiki

Exporting

Zim will be able to export content to various formats. At the moment exporting to HTML and LaTeX is supported, as well as the Markdown and RST text formats.

Export dialog
To open op the export dialog in zim use the "File->Export" menu item. This dialog asks for a number of input fields before you can start exporting.

Step1: Select the pages to export
The option Complete Notebook will export all pages in the current notebook.

The option Single page allows to select you a single page to export.

When the Include subpages option is selected all pages below the selected page will be exported as well recursively.

Step 2: Select the export format
The Format allows the choice of the output format.

The Template field asks you to select a template file (see below). When you select "Other..." in the combo box you can browse for another file in the input field below the combo box.

If your notebook has a Document Root (see Properties) you can select what to do with links to files under that document root. Either Link files under document root with full file path, which means files will be linked by their absolute file path, or Map document root to URL, which will result in links with the given URL as prefix. This can be useful when you publish pages as part of a larger website.

Step 3: Select the output file or folder
Depending on the choice of pages to export and the format to export you can get to choice to either Export each page to a separate file or to Export all pages to a single file. Exporting each page to a separate file typically results in a folder with multiple files, one for each page that is exported, very similar to the zim notebook itself. Exporting to a single file creates a different view where all pages are combined in a single output template.

Here you can select the output folder (if you are exporting multiple pages) or the output file (if you export a single page).

If you specify an Index page a page will be generated that contains a list with links to all pages that were exported. This can e.g. be used as a site map.

Attachments
Files and images that live inside the notebook directory (attachments, equations etc.) will always be copied to the new output directory when you export a notebook.

Templates
The export code only produces the tags that represent the content of the page. Templates are used to create complete output. A few standard templates are packaged with zim, see the pages for the output formats for a list and descriptions. You can also make your own.

Exporting from the commandline
Try something like:

$ zim --export --output=./html \
--format=html --template=./foo.html ~/Notes
See "zim --help" for all options.

nhb42 · 发表于 2020-2-24 19:28:40

喬治兄发表于 2020-2-24 16:03
Brother nhb42:
Maybe you have tried the software export function, I really don't have any idea abou ...

I trried this before posting here. this is what written in the website and I followed accordingly. the contents/index of the zim file of simple english wikipedia doesn't shlow up.

one question, did you practically do it before posting here? I mean, did you export a zim file from kiwix before?

喬治兄 · 发表于 2020-2-24 19:38:42

nhb42 发表于 2020-2-24 19:28
I trried this before posting here. this is what written in the website and I followed accordingly. ...

Dear Brother nhb42:
Not Yet to try it, so sorry abut it.

firetimer · 发表于 2020-2-29 21:09:52

本帖最后由 firetimer 于 2020-3-1 00:41 编辑

After looking into the format of the file produced by Zim, I guess it is just a coincidence for these files to have the same suffix... Further investigation needed and I'm still trying.
UPDATE 1: Seems that https://github.com/tim-st/go-zim may help. But I haven't tested it.
UPDATE 2: This website https://wiki.openzim.org/wiki/OpenZIM seems to be an official site of this file format.
UPDATE 3: Got some result but the results are rather bad, without HTML tags, not alphabetically sorted, no titles...UPDATE 4: Why not try to convert from more "constructured" sources (https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=13368) or so?

The Florida Keys Hurricane of 1919 was a strong hurricane in September 1919. It killed more than 770 people. It moved through the Florida Keys and southern Texas. It was the first such storm to cause a lot of damage in Corpus Christi, Texas. This storm did $22 million in damage. In Texas alone, the official number of deaths was 286, but it may have been closer to 600. The winds were around Category 3 level between Brownsville and Corpus Christi. They were at Category 4 levels over the Florida Keys.
An open star cluster, also known as galactic cluster, is a group of a few hundred or thousand stars. They have roughly the same age, and were formed from the same giant molecular cloud.
Granhagen was born in Luleå, Sweden. She was engaged to actor Olof Thunberg.
HMS Victory is the oldest ship still in use. It is in Portsmouth, England with the HMS Warrior and the remains of the Mary Rose, a ship belonging to Henry VIII of England.
The worship of Aphroditus comes from Cyprus, a cult of a masculine or hermaphrodite form of Aphrodite. The divinity was introduced into Greece and celebrated in Athens in a cross-dressing ritual.
It is thought that Hesiod's myth, explaining the birth of Aphrodite, born when Cronus cut off the genitals of Uranus and threw it into the sea, originated from the cult of Aphroditus. A terracotta plaque from the 7th century BC found at Perachora in Greece, representing Aphroditus emerging from severed male genitals, suggests this, as there are two different myths of the creation of Aphrodite.
<...>

firetimer · 发表于 2020-3-1 13:06:30

本帖最后由 firetimer 于 2020-3-1 14:10 编辑

Finally I got it by using zimdump on Ubuntu. If you can deal with raw html files I may send them to you after full processing of the whole file. Waiting for your response. A sample can be found here: https://www.pdawiki.com/forum/thread-38967-1-1.html

nhb42 · 发表于 2020-3-1 14:19:10

firetimer 发表于 2020-3-1 13:06
Finally I got it by using zimdump on Ubuntu. If you can deal with raw html files I may send them to ...

will the image files be unpacked too?

firetimer · 发表于 2020-3-1 14:21:09

本帖最后由 firetimer 于 2020-3-1 14:23 编辑

nhb42 发表于 2020-3-1 14:19
will the image files be unpacked too?

I didn't check it but there are thousands of images and javascripts in the directory. Sounds seem to be missing.
Extracting process is still going on because of the huge data size...

firetimer · 发表于 2020-3-1 14:28:22

本帖最后由 firetimer 于 2020-3-1 14:32 编辑

Known issues:
1. Sounds seem to be missing. (because they are not packed into .zim files?)
2. No ".htm" suffix so should be processed manually (I can add suffix but for links in html files it's hard for me)
3. Items with special symbols/phrases not acceptable as part of file name by Windows or Ubuntu seem to be missing. Such as : !!!, $NT (Unsolvable unless you and me use another operating system other than these two...)

nhb42 · 发表于 2020-3-1 14:31:54

firetimer 发表于 2020-3-1 14:28
Known issues:
1. Sounds seem to be missing.
2. No ".htm" suffix so should be processed manually (I c ...

ok! I can deal with that. it's better to have something than nothing...
Reminder: I just need this file https://download.kiwix.org/zim/wikipedia_en_simple_all_maxi.zim mentioned in the thread.
I'm waiting for the unpacked files.

firetimer · 发表于 2020-3-1 14:53:25

本帖最后由 firetimer 于 2020-3-1 14:54 编辑

As the files will be of an ENORMOUS amount of over 3 MILLION(and take hours or even tens of hours to extract completely), I decide to use one of the compressing software. Do you accept .zip, .rar or .7z format? (you can choose one or more of them and I would have a try to determine whether it support such an ENORMOUS amount...)
(I'm worrying about whether my(or your) hard disk would accept them...)

firetimer · 发表于 2020-3-1 16:28:23

本帖最后由 firetimer 于 2020-3-1 16:30 编辑

Some tips. As the folder structure is no longer remained in each section, all links should be modified to suit these changes, otherwise you will get pages that reports many "File not found" errors.
For example, in original file a image is stored in "I/m/Weather.png", in a folder named "m" nested in the big folder "I"(Images). But now it becomes "I/m%2fWeather.png"(in folder "I") so the image is no longer showed correctly (without proper find & replace). My suggestion is to apply a regex which can read all links after "../I"(and "../-" or so) and convert all slashes into "%2f"(HTML replacement of a slash).

Progress: All HTML seems to be successfully dumped. Zimdump is now dealing with images.
It seems right, isn't it?

nhb42 · 发表于 2020-3-1 22:43:22

本帖最后由 nhb42 于 2020-3-1 22:44 编辑

firetimer 发表于 2020-3-1 16:28
Some tips. As the folder structure is no longer remained in each section, all links should be modifi ...

I totally got it! I will make an MDX version of this encyclopedia... Currently, I'm using a limited data connection. I've transferred the files you provided to my Baidu cloud. In 4-5 days I will have a broadband connection, then I'll process the data for making MDX.

And off-course, I'll give you the MDX version as soon as I make it.

Thank you for your hard-work...

firetimer · 发表于 2020-3-2 00:10:39

nhb42 发表于 2020-3-1 22:43
I totally got it! I will make an MDX version of this encyclopedia... Currently, I'm using a limited ...

And more, I'm trying to solve some problems mentioned before and may introduce a BRAND NEW extractor based on another solution in my post later this week. 2x faster and compatible with Windows 10. If you are interested I will tell you by Private Messages.

George... · 发表于 2020-3-3 22:13:49

谢谢楼主分享

[悬赏] How to Convert Kiwix *.zim Files into Accessible Text Files (txt, html etc.)

最佳答案

评分

评分

点评

评分

点评

评分

点评

点评

点评