查看: 994|回复: 25
打印 上一主题 下一主题

[悬赏] How to Convert Kiwix *.zim Files into Accessible Text Files (txt, html etc.)

[复制链接]
  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    跳转到指定楼层
    1
    发表于 2020-2-23 14:04:20 | 只看该作者 |只看大图 回帖奖励 |倒序浏览 |阅读模式
    555
    本帖最后由 nhb42 于 2020-2-23 14:21 编辑

    I want to make an mdx version of latest Simple English Wikipedia (with pictures) from the *.zim file provided by Kiwix: http://download.kiwix.org/zim/wikipedia_en_simple_all_maxi.zim.
    But I couldn't find a way to extract the data from the zim file or convert it into txt, htm or html format.
    I also tried a software named Zim:https://zim-wiki.org/downloads.html to export the zim file into htm, but couldn't do anything with it.
    Is it really possible to extract the data from the zim file?
    If there's a way, please let me know...
    And if it's not possible, can anyone grab the original website data? (Which I found very messy to crawl, I tried Cytok WebCopy to crawl the pages but a lot of extra pages were downloaded...)


    One more thing, I'm using Windows 10.


    最佳答案

    查看完整内容

    Raw data archives are here. Password of them is "ensimp" 链接:https://pan.baidu.com/s/1-m8ptNbq7Fe0lfUxsU9I4w 提取码:48pf Redirects, links, still long way to go...
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    2
    发表于 2020-2-23 14:04:21 | 只看该作者
    Raw data archives are here. Password of them is "ensimp"

    链接:https://pan.baidu.com/s/1-m8ptNbq7Fe0lfUxsU9I4w
    提取码:48pf
    Redirects, links, still long way to go...

    评分

    1

    查看全部评分

  • TA的每日心情
    开心
    2018-8-8 03:13
  • 签到天数: 1 天

    [LV.1]初来乍到

    254

    主题

    4264

    回帖

    7万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    79050

    小蜜蜂章笑傲江湖章灌水大神章QQ 章

    QQ
    3
    发表于 2020-2-23 14:32:14 | 只看该作者
    本帖最后由 喬治兄 于 2020-2-23 14:54 编辑

    Dear Brother nhb42:
    Try this
    http://www.shouce.ren/post/view/id/5170
    https://shazi.info/mediawiki-1-2 ... 9%A8-visual-editor/
    https://www.mediawiki.org/wiki/Extension:VisualEditor
    https://zim-wiki.org/


    Exporting from the commandline
    Try something like:
    $ zim --export --output=./html \  --format=html --template=./foo.html ~/Notes
    See "zim --help" for all options.

    Dear Brother nhb42, you don' t have to convert the .zim file to mdx.
    Goldendict support the .zim file format with fulltext search perfectly.


  • TA的每日心情
    开心
    2020-11-10 14:17
  • 签到天数: 580 天

    [LV.9]以坛为家II

    0

    主题

    1077

    回帖

    9806

    积分

    禁止发言

    积分
    9806
    4
    发表于 2020-2-23 19:43:06 | 只看该作者
    喬治兄 发表于 2020-2-23 14:32
    Dear Brother nhb42:
    Try this
    http://www.shouce.ren/post/view/id/5170

    but eudic doesnt support zim file it is best to convert it into mdx format
  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    5
     楼主| 发表于 2020-2-24 10:23:51 | 只看该作者
    本帖最后由 nhb42 于 2020-2-24 10:27 编辑

    actually... it didn't help me, I mentioned that I used Zim software, but all in vain. the reasons might be that I have no programming knowledge. besides, I use only MDict and do not have any intention to use other apps; and Kiwix already can read ZIM files. I just want to use all dictionaries in one app. So using MDX is a Hobson's Choice!

    And a request, if you can convert it [Simple English Wikipedia] into TXT or HTML format, please feel free to share a cloud link to the converted file.
  • TA的每日心情
    开心
    2018-8-8 03:13
  • 签到天数: 1 天

    [LV.1]初来乍到

    254

    主题

    4264

    回帖

    7万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    79050

    小蜜蜂章笑傲江湖章灌水大神章QQ 章

    QQ
    6
    发表于 2020-2-24 16:03:23 | 只看该作者
    本帖最后由 喬治兄 于 2020-2-24 16:09 编辑
    nhb42 发表于 2020-2-24 10:23
    actually... it didn't help me, I mentioned that I used Zim software, but all in vain. the reasons m ...

    Brother nhb42:
    Maybe you have tried the software export function, I really don't have any idea about the software.

    https://zim-wiki.org/manual/Help/Export.html

    Zim - A Desktop Wiki

    Exporting

    Zim will be able to export content to various formats. At the moment exporting to HTML and LaTeX is supported, as well as the Markdown and RST text formats.

    Export dialog
    To open op the export dialog in zim use the "File->Export" menu item. This dialog asks for a number of input fields before you can start exporting.

    Step1: Select the pages to export
    The option Complete Notebook will export all pages in the current notebook.

    The option Single page allows to select you a single page to export.

    When the Include subpages option is selected all pages below the selected page will be exported as well recursively.

    Step 2: Select the export format
    The Format allows the choice of the output format.

    The Template field asks you to select a template file (see below). When you select "Other..." in the combo box you can browse for another file in the input field below the combo box.

    If your notebook has a Document Root (see Properties) you can select what to do with links to files under that document root. Either Link files under document root with full file path, which means files will be linked by their absolute file path, or Map document root to URL, which will result in links with the given URL as prefix. This can be useful when you publish pages as part of a larger website.

    Step 3: Select the output file or folder
    Depending on the choice of pages to export and the format to export you can get to choice to either Export each page to a separate file or to Export all pages to a single file. Exporting each page to a separate file typically results in a folder with multiple files, one for each page that is exported, very similar to the zim notebook itself. Exporting to a single file creates a different view where all pages are combined in a single output template.

    Here you can select the output folder (if you are exporting multiple pages) or the output file (if you export a single page).

    If you specify an Index page a page will be generated that contains a list with links to all pages that were exported. This can e.g. be used as a site map.


    Attachments
    Files and images that live inside the notebook directory (attachments, equations etc.) will always be copied to the new output directory when you export a notebook.

    Templates
    The export code only produces the tags that represent the content of the page. Templates are used to create complete output. A few standard templates are packaged with zim, see the pages for the output formats for a list and descriptions. You can also make your own.

    Exporting from the commandline
    Try something like:

    $ zim --export --output=./html \
      --format=html --template=./foo.html ~/Notes
    See "zim --help" for all options.



    评分

    1

    查看全部评分

  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    7
     楼主| 发表于 2020-2-24 19:28:40 | 只看该作者
    喬治兄 发表于 2020-2-24 16:03
    Brother nhb42:
    Maybe you have tried the software export function, I really don't have any idea abou ...

    I trried this before posting here. this is what written in the website and I followed accordingly.  the contents/index of the  zim file of simple english wikipedia doesn't shlow up.

    one question, did you practically do it before posting here? I mean, did you export a zim file from kiwix before?
  • TA的每日心情
    开心
    2018-8-8 03:13
  • 签到天数: 1 天

    [LV.1]初来乍到

    254

    主题

    4264

    回帖

    7万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    79050

    小蜜蜂章笑傲江湖章灌水大神章QQ 章

    QQ
    8
    发表于 2020-2-24 19:38:42 | 只看该作者
    nhb42 发表于 2020-2-24 19:28
    I trried this before posting here. this is what written in the website and I followed accordingly. ...

    Dear Brother nhb42:
    Not Yet to try it, so sorry abut it.

    点评

    LOL  发表于 2020-2-24 20:36
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    9
    发表于 2020-2-29 21:09:52 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 00:41 编辑

    After looking into the format of the file produced by Zim, I guess it is just a coincidence for these files to have the same suffix... Further investigation needed and I'm still trying.
    UPDATE 1: Seems that https://github.com/tim-st/go-zim may help. But I haven't tested it.
    UPDATE 2: This website https://wiki.openzim.org/wiki/OpenZIM seems to be an official site of this file format.
    UPDATE 3: Got some result but the results are rather bad, without HTML tags, not alphabetically sorted, no titles...UPDATE 4: Why not try to convert from more "constructured" sources (https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=13368) or so?
    The Florida Keys Hurricane of 1919 was a strong hurricane in September 1919. It killed more than 770 people. It moved through the Florida Keys and southern Texas. It was the first such storm to cause a lot of damage in Corpus Christi, Texas. This storm did $22 million in damage. In Texas alone, the official number of deaths was 286, but it may have been closer to 600. The winds were around Category 3 level between Brownsville and Corpus Christi. They were at Category 4 levels over the Florida Keys.
    An open star cluster, also known as galactic cluster, is a group of a few hundred or thousand stars. They have roughly the same age, and were formed from the same giant molecular cloud.
    Granhagen was born in Luleå, Sweden. She was engaged to actor Olof Thunberg.
    HMS Victory is the oldest ship still in use. It is in Portsmouth, England with the HMS Warrior and the remains of the Mary Rose, a ship belonging to Henry VIII of England.
    The worship of Aphroditus comes from Cyprus, a cult of a masculine or hermaphrodite form of Aphrodite. The divinity was introduced into Greece and celebrated in Athens in a cross-dressing ritual.
    It is thought that Hesiod's myth, explaining the birth of Aphrodite, born when Cronus cut off the genitals of Uranus and threw it into the sea, originated from the cult of Aphroditus. A terracotta plaque from the 7th century BC found at Perachora in Greece, representing Aphroditus emerging from severed male genitals, suggests this, as there are two different myths of the creation of Aphrodite.
    <...>

    评分

    1

    查看全部评分

  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    10
    发表于 2020-3-1 13:06:30 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 14:10 编辑

    Finally I got it by using zimdump on Ubuntu. If you can deal with raw html files I may send them to you after full processing of the whole file. Waiting for your response. A sample can be found here: https://www.pdawiki.com/forum/thread-38967-1-1.html

    点评

    no problem! i can deal with the html files...  发表于 2020-3-1 14:24
    OK! send me the whole file after full processing...  发表于 2020-3-1 14:18

    评分

    1

    查看全部评分

  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    11
     楼主| 发表于 2020-3-1 14:19:10 | 只看该作者
    firetimer 发表于 2020-3-1 13:06
    Finally I got it by using zimdump on Ubuntu. If you can deal with raw html files I may send them to  ...

    will the image files be unpacked too?
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    12
    发表于 2020-3-1 14:21:09 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 14:23 编辑
    nhb42 发表于 2020-3-1 14:19
    will the image files be unpacked too?

    I didn't check it but there are thousands of images and javascripts in the directory. Sounds seem to be missing.
    Extracting process is still going on because of the huge data size...

    点评

    OK! doesn't matter a lot. I need the text contents and as many images as possible...  发表于 2020-3-1 14:22
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    13
    发表于 2020-3-1 14:28:22 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 14:32 编辑

    Known issues:
    1. Sounds seem to be missing. (because they are not packed into .zim files?)
    2. No ".htm" suffix so should be processed manually (I can add suffix but for links in html files it's hard for me)
    3. Items with special symbols/phrases not acceptable as part of file name by Windows or Ubuntu seem to be missing. Such as : !!!, $NT (Unsolvable unless you and me use another operating system other than these two...)
  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    14
     楼主| 发表于 2020-3-1 14:31:54 | 只看该作者
    firetimer 发表于 2020-3-1 14:28
    Known issues:
    1. Sounds seem to be missing.
    2. No ".htm" suffix so should be processed manually (I c ...

    ok! I can deal with that. it's better to have something than nothing...
    Reminder: I just need this file https://download.kiwix.org/zim/wikipedia_en_simple_all_maxi.zim mentioned in the thread.
    I'm waiting for the unpacked files.

    点评

    And for sure, I'm working on just the file you have provided, with 383380 items and 3 million titles. The file name is automatically renamed to wikipedia_en_simple_all_maxi_2020-02.zim when I clicked.  发表于 2020-3-1 14:44
    Would you please give me a sample after processing? I'm interested in what these thousands of tons of files could be^_^  发表于 2020-3-1 14:35
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    15
    发表于 2020-3-1 14:53:25 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 14:54 编辑

    As the files will be of an ENORMOUS amount of over 3 MILLION(and take hours or even tens of hours to extract completely), I decide to use one of the compressing software. Do you accept .zip, .rar or .7z format? (you can choose one or more of them and I would have a try to determine whether it support such an ENORMOUS amount...)
    (I'm worrying about whether my(or your) hard disk would accept them...)
  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    16
    发表于 2020-3-1 16:28:23 | 只看该作者
    本帖最后由 firetimer 于 2020-3-1 16:30 编辑

    Some tips. As the folder structure is no longer remained in each section, all links should be modified to suit these changes, otherwise you will get pages that reports many "File not found" errors.
    For example, in original file a image is stored in "I/m/Weather.png", in a folder named "m" nested in the big folder "I"(Images). But now it becomes "I/m%2fWeather.png"(in folder "I") so the image is no longer showed correctly (without proper find & replace). My suggestion is to apply a regex which can read all links after "../I"(and "../-" or so) and convert all slashes into "%2f"(HTML replacement of a slash).

    Progress: All HTML seems to be successfully dumped. Zimdump is now dealing with images.
    It seems right, isn't it?

  • TA的每日心情
    郁闷
    2023-2-15 01:29
  • 签到天数: 566 天

    [LV.9]以坛为家II

    46

    主题

    1086

    回帖

    5万

    积分

    状元

    Rank: 9Rank: 9Rank: 9

    积分
    56254

    QQ 章灌水大神章笑傲江湖章

    17
     楼主| 发表于 2020-3-1 22:43:22 | 只看该作者
    本帖最后由 nhb42 于 2020-3-1 22:44 编辑
    firetimer 发表于 2020-3-1 16:28
    Some tips. As the folder structure is no longer remained in each section, all links should be modifi ...

    I totally got it! I will make an MDX version of this encyclopedia... Currently, I'm using a limited data connection. I've transferred the files you provided to my Baidu cloud. In 4-5 days I will have a broadband connection, then I'll process the data for making MDX.

    And off-course, I'll give you the MDX version as soon as I make it.


    Thank you for your hard-work...


  • TA的每日心情
    慵懒
    2022-5-26 19:03
  • 签到天数: 55 天

    [LV.5]常住居民I

    5

    主题

    124

    回帖

    740

    积分

    举人

    Rank: 4

    积分
    740

    QQ 章

    18
    发表于 2020-3-2 00:10:39 | 只看该作者
    nhb42 发表于 2020-3-1 22:43
    I totally got it! I will make an MDX version of this encyclopedia... Currently, I'm using a limited ...

    And more, I'm trying to solve some problems mentioned before and may introduce a BRAND NEW extractor based on another solution in my post later this week. 2x faster and compatible with Windows 10. If you are interested I will tell you by Private Messages.

    点评

    I'm interested...  发表于 2020-3-2 11:13
  • TA的每日心情

    2021-6-30 17:08
  • 签到天数: 79 天

    [LV.6]常住居民II

    0

    主题

    173

    回帖

    1033

    积分

    解元

    Rank: 5Rank: 5

    积分
    1033
    19
    发表于 2020-3-3 22:13:49 | 只看该作者
    谢谢楼主分享