Migrating existing Smoothie content to Makerforums

Arthur_Wolf · March 24, 2019, 3:15pm

Hey there nice people of the maker forums.

So, I’m thinking about what to do about the end of G+.
You guys have offered to open a Smoothie category, that’s nice thanks, I’ve added a link to it in Smoothie’s top menu.

Now here are the things I’m wondering about :

Could you guys take the history of G+ posts from the old Smoothie G+ community ( it’s full of really really good data ) and re-insert it here seamlessly as a history of forum posts ?
If that’s possible, I think I’d seriously consider using this forum as the official G+ archive for Smoothie, instead of self-hosting it.
Smoothie has a forum, hosted at http://forum.smoothieware.org/forum/c-496918/general . It’s a Wikidot forum, because Smoothie used to be a wikidot wiki years ago ( I copied the guys at contraptor who I was working with at the time ), but it turned out Wikidot sucked, so I moved to self-hosting ( dokuwiki, own server ) a few years back. But I couldn’t move the forum, so that stayed on Wikidot.
Is it imaginable to do an export/grab of that forum ( potentially I could maybe do the script for that myself ), and then re-insert that forum’s history here seamlessly as a history of posts that’d look like they were posted here originally ?
If that’s imaginable, I think I’d seriously consider using this forum as both the G+ archive, and as the new forum for Smoothie’s community. Comments welcome both from the Smoothie community, this community, and whomever would feel like helping with the migration efforts.

Thanks a ton in advance for anything you do or think about

funinthefalls · March 24, 2019, 4:08pm

@mcdanlj Do you have a takeout of the Smoothieware G+ community?

mcdanlj · March 24, 2019, 4:30pm

@Arthur_Wolf and @funinthefalls A total of 722 G+ Smoothie posts imported to Discourse topics, with 4840 comments imported to Discourse posts, after spam filtering one post by someone identified to have posted spam or spam-like content in a previous import of at least one other community, and 110 comments on posts so identified.

mcdanlj · March 24, 2019, 4:36pm

Imaginable. I’ve don’t know how much work. Can you please generate a ZIP archive snapshot/backup of your wiki for me to look through?

Arthur_Wolf · March 27, 2019, 10:46pm

Hey !

I’ve just seen this category filling up with the G+ data, thanks a ton that is a very much appreciated import, and I’m so glad this data will be available here.

Now about the forum, wikidot doesn’t offer a good way to backup the forum data, so I’m going to need to code a crawler that grabs the whole thing, and saves it in a format that’s usable.
Can you give me a format that you want that would make it easy for you to import the old forum’s post ? I would guess that if I do my export in the same format the G+ posts were save as, then you’d be able to just run your G+ script on this forum data and have everything work without any extra work, right ?
If that’s something you think would work, then all I need from you is an example of the data format your script wants to eat. Just tell me and I’ll get right on it.

Thanks for all the support ! If we get this all to work, this forum becomes the official smoothie forum

mcdanlj · March 27, 2019, 11:06pm

You’re very welcome!

Have you tried the zip archive that wikidot say they offer and that I linked to the help for? I’d love to look through that and see how close to usable it is.

My G+ import script depends on the google user information that’s scraped from G+, so it wouldn’t work for importing from wikidot.

If we need to crawl the site, I’ll need two kinds of data, user and post. I’ll want it all in JSON because why anything else?

User:

email (alternatively, we have to make up @wikidot.invalid which Discourse knows not to send email to)
username
unique wikidot ID (if there is such a thing separate from the username)

Picking a user at random, looking at http://www.wikidot.com/user:info/papergeek the information is sparse. As a logged in user, do you have more information?

Then for posts, I need posts and their associated comments.

For a post, I need:

A unique ID for the post itself (could be the URL)
The title of the post
The date (ISO timestamp) on which it was created
The id of the author (whatever we decided above)
The text of the post, which I need to get into markdown (which includes simple HTML) or bbcode
List of any images to attach to the end — though I think if we just use links to images in wikidot, the system will auto-download the images, so I don’t think we need to copy them over the way I did with G+
If you have sub-categories within the forum, some way of indicating that so I can convert them to tags
A list of comments, each of which has all the attributes of a post except for comments. (I am not aware of an ability to preserve threading, so just in linear order, I guess)

Arthur_Wolf · March 27, 2019, 11:10pm

Yep I did the wikidot export this afternoon, and it gives me the wiki pages but zero forum data.

About crawling the forums, I can do the script, but because of how wikidot sets things up, I don’t have access to any user data, so for example I won’t have user emails etc. Pretty much I’m going to write a crawler that gets anything that’s public, nothing more.

I think I can get all the information you are requesting for the posts though, I don’t know when I’ll have time to work on the crawler code, but you’ll know when I do

If you can show me the exact json structure for the export I can just follow that, I think that’s what would be the least work for you, just give me example json data ( from another export/import for example ? ). wolf.arthur@gmail.com

Cheers, and thanks again.

mcdanlj · March 28, 2019, 1:17am

Well, that’s… awesome.

Moved the rest of the conversation to email.

If someone here other than @Arthur_Wolf is interested in doing the work to get me data sooner, PM me and I’ll help you know what to do. I have it all written down in a long email. Short version: if you know — or want to learn — Selenium, it’s probably a few hours work.

funinthefalls · March 28, 2019, 2:01am

Took a look at Selenium…way above my pay grade…lol But if there is something simple I can do, let me know.

mcdanlj · March 28, 2019, 11:17pm

For anyone, @Arthur_Wolf or otherwise, who is interested in helping, here are some resources for scraping javascript-driven websites like wikidot:

GitHub - tahama/scrapbookq: ScrapbookQ is a Firefox extension, which helps you to save Web pages and easily manage collections. Compatible with old ScrapBook extension, Suppert manage captures at browser sidebar. — firefox extension, might be enough to get data?
How to Scrape Web using Python, Selenium and Beautiful Soup · Swetha's Blog — step by step, basically gives you a Python environment driving the browser to walk the site, so you just have to write a python script which @Arthur_Wolf knows how to do…
Introduction to Web Scraping using Selenium | by Roger Taracha | The Andela Way | Medium — more of the same
GitHub - internetarchive/brozzler: brozzler - distributed browser-based web crawler — The Internet Archive uses this
GitHub - ikreymer/browsertrix: (Note: This repository is obsolete, please see the new Browsertrix webrecorder/browsertrix) Browser-Based On-Demand Web Archiving Automation — instance scraper, just add docker?
Archiving web sites [LWN.net] — Linux Weekly News, as usual, points to quite a few resources, including some of these.

The Google+ importer I wrote is built around a core assumption of google users, which isn’t a facility wikidot provides, so it would be a false economy to try to re-use it. It will actually be easier for me to just start with something simple than try to work around it not being the same format.

Something like this would be easy to import:

{
  'users': [
    'arthurwolf': 'Arthur_Wolf',
  ],
  'posts': [
    {
      'id': 'post-content-2205295',
      'url': 'http://forum.smoothieware.org/forum/t-1081758/external-drivers',
      'title': 'External drivers',
      'date': '2015-01-12T21:25Z',
      'author': 'harry11733',
      'text': '<p>I plan to use some 570 oz stepper motors on a vertical mill CNC conversion. I am trying to decide what stepper motor drivers to use. It seems that the DQ860MA Stepper Motor Driver is the only one that has been reported as working with the smoothieboard.</p>
<p>I am interested in trying the digital steppers from Automation Technology, specifically the KL-5056D driver, which people have been happy with for this purpose. Do these drivers have any advantages or disadvantages relative to the DQ860MA? Will they even work with the smoothieboard? I don't really understand how these newer digital drivers work, for all I know it may be the same basic technology as the DQ860MA.</p>
<p>I am happy to pay a little more for the Automation Technology products in the anticipation of better customer support.</p>
<p>The current technology that people use for mill CNC conversions seems a bit nutty to me. They use the ethernet smooth stepper board to convert an ethernet signal to a parallel port in order to transmit the g-code from Mach 3/4 to the drivers. This seems circuitous compared to using a smoothieboard just to translate a g-code file, and much more expensive, but maybe I am missing some advantage that this other system offers.</p>',
      'comments': [
        {
          'id': 'post-content-2205401',
          'date': '2015-01-13T00:35Z',
          'author': 'bouni',
          'text': '<p>Hi,</p>
<p>I\'ve tried the DQ860MA external steppers with the smoothieboard and they work without any problems so far.</p>
<p><strong>vimeo.com / 115509540</strong></p>
<p>In my opinion the KL-5056D should work as well.<br>
In the <strong>kelinginc.net / KL-5056D.pdf</strong> , page 4 figure 3 you can see how you have to wire the drivers to the ST,DIR,EN and GND pins of the smothieboard. The internal resistors are 270Ohm and calculated for 5VDC signals, as far as i know the smoothieboard outputs only 3.3V, but for my DQ860MA that was not a problem, the optocouplers get only 8mA in this case but it seems to be enough to switch them.</p>
<p>Bouni</p>'
        },
        {
          'id': 'post-content-2205875',
          'date': '2015-01-13T10:21Z',
          'author': 'arthurwolf',
          'text': '<blockquote>
<p>I plan to use some 570 oz stepper motors on a vertical mill CNC conversion. I am trying to decide what stepper motor drivers to use. It seems that the DQ860MA Stepper Motor Driver is the only one that has been reported as working with the smoothieboard.</p>
</blockquote>
<p>Pretty much any external driver with a step/direction interface will work with Smoothieboard.</p>
<p>In some cases the driver will want 5V input, and Smoothieboard outputs 3.3V, but generally the drivers are fine with 3.3V even if rated at 5V. If 3.3V is not sufficent it\'s trivial to use a level shifter to bump the 3.3V up to 5V.</p>
<p>So generally, the vast majority of external drivers work out of the box with Smoothieboard.</p>
<p>I personally use the CW5045&nbsp;and am pretty happy with it.</p>
<blockquote>
<p>I am interested in trying the digital steppers from Automation Technology, specifically the KL-5056D driver, which people have been happy with for this purpose. Do these drivers have any advantages or disadvantages relative to the DQ860MA? Will they even work with the smoothieboard?</p>
</blockquote>
<p>Yes.<br>
All of those drivers are very similar, and work with Smoothie.</p>
<p>Pretty much, if you read DIR+ DIR- EN+ EN- PUL+ PUL- on it, you know it\'ll work with Smoothieboard.</p>
<blockquote>
<p>The current technology that people use for mill CNC conversions seems a bit nutty to me. They use the ethernet smooth stepper board to convert an ethernet signal to a parallel port in order to transmit the g-code from Mach 3/4 to the drivers.</p>
</blockquote>
<p>Yeah that\'s just a relic of the 80s :)</p>
<p>Smoothie is the modern way to do it :)</p>'
        }
      ]
    ]
  ]
}

The user mapping to a string means to attach those posts to an existing makerforums user. You can fill those in where you know the mapping. Otherwise I’ll just create a new user whenever I need to. Those won’t give people magic edit rights like they do for G+ posts imported, but it’s a support tool for @Arthur_Wolf’s forum, and it’s the best I can do. I think that referencing the original URL and author inasmuch as we have that information would comply with the license terms posted there; at least, that’s my intent.

The URL I put in the example shows the source page that I used to create the example. The ID I put in there is from what they put on the div, and Discourse imports really like unique IDs from the source system.

mcdanlj · April 6, 2019, 2:13am

This doesn’t actually look hard to do with selenium. I think I’m 80% of the way there after about 30 minutes of poking around.

funinthefalls · April 6, 2019, 2:26am

Show off

mcdanlj · April 6, 2019, 11:21am

I should have been clear, that was only of the scraper, not the importer, and of course the last 20% of the scraper will take the other 80% of the time! Like running into my script breaking a long way in from finding out how wikidot handles deleted users, and having to start over.

I discovered html2markdown, so the posts will look better here.

I don’t see a trivial way to preserve threading, but since Discourse hides the threading for the most part, I’m not going to worry about that.

mcdanlj · April 6, 2019, 12:50pm

@Arthur_Wolf, do you want to try to map at least some well-known users from the old forum here? I can do that slightly more easily before doing the import than afterward.

Unless something unexpected happens, I expect to complete the import today. You will of course own the arthurwolf posts, and if I happen to see any users who obviously map to users here I’ll map them. Here’s where I’m tracking the user maps:

arthurwolf @Arthur_Wolf
wolfmanjm @Wolfmanjm
WillAdams @WillAdams
Ben Delarre @Ben_Delarre
Dat Chu @Dat_Chu
Bonne Wilce @Bonne_Wilce
Patrick Darrieulat @Patrick_Darrieulat
timrastall @Tim_Rastall
mrehorst @Mark_Rehorst
GeorgeIoak @George_Ioakimedes
Fabio Ferretti @Fabio_Ferretti
Mert G @Mert_G
Jim McPherson @Jim_McPherson
DeWayne Young @DeWayne_Young
Joseph Lenox @Joseph_Lenox
Trevor Byers @trevor_byers
LuisRodriguezAlcalde @Luis_Rodriguez_Alcal
Ariel Yahni @Ariel_Yahni_UniKpty
Matt Omond @Matt_Omond
Matt Mann @Matt_Mann
Scott Gamons @Scott_Gammons
Chris Pops Kearsley @chris_kearsley
Craig Anderson @Craig_Anderson
Cory Brown @cory_brown
Derek Shuetz @Derek_Schuetz
Mark Ingle @Mark_Ingle
Bruce Lunde @Bruce_Lunde
Robi Akerley-Mc Kee @Robi_Akerley-McKee
Parth Panchal @Parth_Panchal
Stuart Middleton @Stuart_Middleton
nicklindenmuth @Nick_Lindenmuth
Hansel Mariano @Hansel_Marinao
Tom Mertens @Tom_Mertens
geoffreyforest @Geoffrey_Forest
Yale Zhang @Yale_Zhang
Robin Evans @Robin_Evans
nicholasseward @Nicholas_Seward
Colin Rowe @Colin_Rowe
Kristofer Burbano @Kristofer_Burbano
Rachel Rand @Rachel_Rand
Rob Giseburt @Rob_Giseburt
Rob Thompson @Rob_Thompson

That’s all I could confidently map. Everyone else will have abandoned made-up names ending with _SW. Sorry.

mcdanlj · April 6, 2019, 1:49pm

I finished a complete run of the scraper, and it looks sane at first glance:

$ ls -lh smoothie.json 
... 5.6M Apr  6 09:42 smoothie.json
$ jq '.posts | length' smoothie.json 
1256
$ jq '.users | length' smoothie.json 
970

On to the importer!

mcdanlj · April 6, 2019, 8:29pm

It’s been a mostly-chores Saturday so most of the time away from keyboard, but I am now to the point that I have successful imports happening and have just been tweaking display a bit. 1255 topics, with 4212 comments beyond the initial topic post. All users noted above are mapped. The most time-consuming part of this project was visually scanning 928 users for usernames that were sufficiently meaningful to even check for users here.

This update is now underway. Users not mapped to existing users are created with a _SW suffix to indicate where they came from. They aren’t connected to any authentication.

mcdanlj · April 6, 2019, 9:11pm

Allow me to present all the imported posts from the wikidot forum as part of the Smoothie category

@Arthur_Wolf, you now have a single forum containing all the Google+ and Wikidot conversations about Smoothie.

You are set to change everything to make this the official smoothie forum!

WillAdams · July 12, 2022, 1:36pm

Apparently I wound up w/ two accounts here?

Possible to

mcdanlj · July 12, 2022, 3:25pm

Yup! That’s the default for any import unless there’s actually a key to join them together. The duplicate wasn’t from Smoothie import though; those duplicates have “SW” in the name.