Stories
Slash Boxes
Comments

SoylentNews is people

The Fine print: The following are owned by whoever posted them. We are not responsible for them in any way.

Journal by cafebabe

(Numerous websites are mentioned which I either love or hate. Disliked websites are monitored so that I am aware of subjects which may adversely affect friendly parties. Irrespective of attention/power dynamics, continuous exposure to conflicting opinion is essential. Separately, a minority of websites are monitored and collated on behalf of friends. It is for this reason that I follow an unusual mix of technology, futurology, paranormal, religion, minority rights, finance, design and loony fringe sources.)

Introduction

I'm often off-line for a week or more. I also pay a significant amount for my brief Internet access. Therefore, I have developed a set of scripts to grab and collate a high volume of data whenever possible. By popular request (one friend asked), I present the tidied scripts. Due to the ability to scrape, collate and speak a large volume of text (millions of words per month), these scripts may be of interest to busy people, news-hounds, people with disability and/or people caught in a pandemic.

During my brief sessions at an Internet café, I work through a written paper list of searches. This is mostly pop culture which occurs to me when I'm off-line but may also be postal addresses or technical details which are unresolved. I also work through email or provide acknowledgement that a more detailed response may be forthcoming.

Meanwhile, in the background, I typically fetch more than 2000 URLs (about 5 million words) along with JPEGs, PDFs and video. HTML may exceed 250MB and video may exceed 10GB per hour. This process mostly revolves around wget and a Firefox derivative but also includes the occasional use of OpenOffice or suchlike. Bulk data is primarily:-

  • Plain text format of one URL per line, suitable for wget.
  • HTML bookmark format, suitable for Firefox or similar. This consists of one named hyperlink per line within a bullet list.
  • JavaScript paginated webpages (Twitter, YouTube) may be temporarily pasted into OpenOffice. Thankfully, .odt format is an LZMA PKZip file and content.xml contains data which is analogous to HTML bookmark format.
  • Access logs which allow quality control.
  • Uncompressed directory structure mirrored by wget.
  • Date-stamped archives with partial text index.
  • Distilled text suitable for espeak.

The workflow mostly consists of scripts which compress data, summarize data or convert data from one format to another. At a minimum, scripting is required to convert to and from the formats used by wget and Firefox derivatives.

Fetching When On-Line

I usually have a "generator" list of news sources and one or more categories of bulk data to fetch as time (and storage) allows. If the list is particularly long then the Unix command split -l allow fragments to be fetched in parallel. This generally keeps sessions under one hour. An example list includes:-

With optional filtering, this is effectively run through wget -x -i - with extra parameters to ensure fetching does not get snagged for too long on a dud URL. These index pages can be reviewed off-line (when I'm not paying per hour at an Internet café) and interesting URLs may be collated into another list. I'll explain the collation process after on-line tasks are completed. Instead, assume that URLs of interest have been collated into fragments of tiers of lists. Repeat the fetching process for each fragment, as desired.

In a 20 seat Internet café, I often consume more than half of the available downstream bandwidth. Therefore, I have url-fetch-fast.sh and url-fetch-slow.sh. The latter keeps video transfer and similar under 1/3 of the estimated bandwidth. You may want to adjust this script for your circumstances.

Archiving

At this point, I typically have more than 250MB of HTML sitting in a RAM disk of a live boot at an Internet café. And I often have less than 250MB of persistent storage. Thankfully, news websites are often template driven and it is usually possible to obtain 10:1 compression. It is often possible to obtain 20:1 compression on the Daily Mail website and this is especially true if the URLs are all scraped within one hour because the infamous "sidebar of shame" will be relatively unchanged. Indeed, the major exceptions are the sections for finance, science and travel. Given the depth-first traversal of archiving and compression, these tend to be grouped where the compressor is most able to strip redundancy. Actually, it is quite disconcerting to fetch 100MB from the Daily Mail website and take home a 3MB archive.

I run du -a * | sort -n and compress any directory above 1MB which is not likely to be a single PDF. (There's no point putting a lone turd in a box.) compress-dir.pl accepts any number of directory handles and converts www.foo.tld into www_foo_tldYYYYMMDD.tar.xz where YYYYMMDD is the modification date of the directory. The script also produces a compressed SHA512 manifest and a compressed partial text index which often alleviates extraction of an archive. The partial text index is not case sensitive and does not permit phrase search. However, it is often sufficient to find technical terms and formal names. Indeed, the partial index has words in alphabetical order so that a compressor effectively stems the search terms. Overall, the manifest and search index typically increases storage overhead by less than 10%. The archiving script is intended to be run daily or less frequently. If you want more frequent archiving then you may want to change the time-stamp format within the script.

I run diff -r -q between $HOME/Downloads and persistent storage. It often gives false positives where persistent storage is FAT32 or similar due to transliteration of file names. Then I forget to persist bookmarks which I accumulate during my session. (Doh!) Then I take the data home and extract it.

Bookmark Management

At home, I live-boot, run a web browser, disable JavaScript (because all local HTML has read/write access to filing system), import bookmarks, unpack today's archives back into $HOME/Downloads, categorize stragglers into directories and then run:-

perl /path/to/convert-dir-to-bookmark.pl /path/to/uncompressed/stragglers $HOME/Downloads > /tmp/foo.html

This runs a breath-first, alphabetical enumeration over new data and the result can also be imported into Firefox. It lists files in the first level, second level, third level and suchlike directories separately. This has the advantage of collating (most of) the news indexes sequentially while many of the disadvantages are mitigated with browser bookmark search. Unfortunately, I sometimes forget to categorize URLs such as London's Evening Standard theater reviews. Although, it doesn't help that said URL occasionally drops off their web server configuration and is therefore occasionally not present in the saved data.

./convert-dir-to-bookmark.pl makes a token attempt to retrieve a title from HTML. However, if it has the placeholder title 'foo' then a <title> tag was probably obscured by yards of JavaScript and/or CSS. This should be regarded as a warning about a website's priorities.

From here, it is trivial to categorize bookmarks by priority and topic. From Firefox's bookmark management window, it is trivial to use meta keys to select ranges of mostly sequential bookmarks and open them as browser tabs. For example, it is relatively trivial to open all financial news index pages in one window and fashion news in another. Almost all of the web pages will be displayed without adverts, CSS or JavaScript - which is utter bliss. Unfortunately, without style, elements may be significantly more spread and this may incur significantly more scrolling. This can be partially overcome by tabbing through hyperlinks.

Further use of meta keys allows all interesting hyperlinks to be spawned in additional browser tabs. Obviously, these are all dangling references and all of the browser tabs will contain error messages about network connectivity or suchlike. Regardless, all of the tabs can be bookmarked with a menu item such as "Bookmark all tabs..." and named as required. This process creates a tab explosion and some forks of Firefox may have a hard limit of 1000 tabs per window. Even without this difficulty, I find it useful to occasionally checkpoint and persist bookmarks to storage.

Be aware that Firefox derivatives may silently fail to save bookmarks if there is insufficient space. Combined with the issues of truncation and corruption, bookmark state is of critical importance to the workflow. This leads to numerous versions of bookmark file being saved. It is strongly recommended that bookmark versions are named in globbing order. This simplifies periodic use of diff-bookmark-count.pl when deleting of uninteresting versions. This script treats unannotated URLs as an unordered set. In particular, URLs which have been annotated, categorized and possibly duplicated are regarded as "same".

As a top priority, I find it useful to scour index pages. Firstly, headlines alone keep me abreast of developments. (I'm often a week or three behind published news.) Secondly, if I'm unexpectedly near a friendly Internet café, I'll have a partially collated list of URLs to fetch and this may reduce the usual latency. The list of URLs can be collated by opening the current version of bookmarks.html in a text editor, searching for the first instance of >fetch< or similar and then pasting the remainder as the standard input of:-

perl ./convert-bookmark-to-list.pl | ./dedup-url.pl /path/to/previous/*/fetch*.list - > fetch.list

This list will require extensive tidying. At a minimum, instances of HTTP Method GET parameters should be reviewed. Obvious cases of privacy violation, such as ?fbclid=, are automatically removed but this will fail if it is merged with other parameters. This edge case has not been handled. Likewise for an endless list of unknown parameters. It is for this reason that you should always check. Further checking, sifting, sorting and de-duplication is likely. Regardless, the bulk of the process is automated and it allows unique references to be followed with relative impunity.

  • ./convert-bookmark-to-list.pl contains multiple messy heuristics which may fail. A large number of clickbait websites use absolute URLs so that copypasta hyperlinks resolve back to the correct website. Many of the remainder use relative URLs with identifiable patterns. For example, Wikipedia URLs begin with /w/ or /wiki/. In Firefox's bookmarks, this will be held eroneously as file:///wiki/ or similar. Unfortunately, the Encyclopedia Dramatica uses the same format, so you may want to modify the script according to preferences. Invariably, there is a long-tail of unknowns which have not been promoted to absolute references. If this is not the case then assume truncation. Anyhow, common cases which annoy me are handled. This includes off-line reading of Reddit via re-direction to old.reddit.com with the suffix ?limit=500. If you have example URLs then I may be able to include more cases. However, this may be delayed.

    ./convert-bookmark-to-list.pl has miscellaneous functions. -b converts to bookmark format and preserves ordering. This can be used to convert OpenOffice XML to bookmark format. This allows YouTube video indexes and other JavaScript pagination to be pasted into OpenOffice (while on-line) and then the titles can be loaded as Firefox bookmarks (while off-line) for further consideration. -o and -u create de-duplicated lists which are sorted by URL. The HTML output may use tags for ordered or unordered lists. Such lists can be collated at leisure (while off-line) and inspected manually (while on-line). Overall, this is useful for awkward cases, such as designer websites, websites with pagination or other scripts, or websites which require agreement to terms.

  • ./dedup-url.pl avoids retrieval loops, such as traversal around Wikipedia's non-hierarchical category system. File references can be specified in chronological order using globbing. Novel URLs contained within the final file are sent to standard output. The script contains an exception such that all URLs from SoylentNews and Reddit are regarded as novel. This allows interesting discussions to be followed after the first retrieval. With judicious use of text de-duplication, a delta of all discussions can be collated.

I may work on scripts for quality control (URLs removed from bookmarks but not fetched, wget failures). However, this is currently an infrequent ad hoc process which has not been automated.

Collation And Rendering

So far, I've described a never-ending cycle of fetching URLs. While it is possible to read all data on screen and select more URLs for download, for example, within the body of Wikipedia articles, this process can exhaust all available time. A more productive approach is to distill text and render it as speech. This process may *also* exhaust all available time. However, it allows more information to be absorbed while performing other activities.

An overview of the collation process is:-

cat /path/to/current/bookmarks.html | fgrep YYYYMMDD | egrep -v '_(dailymail|reddit|wikipedia)_' | perl ./convert-html-to-text.pl -t -s -c | perl ./dedup-lines.pl -c -n | perl ./dedup-words.pl -c -n > /tmp/speak.txt

YYYYMMDD is assumed to be consistent with the name of the directory on persistent storage which holds files which were not compressed. It is *also* assumed to be consistent with extracted archives under $HOME/Downloads. If you do not live-boot or compress a subset of data then this split storage complication may not arise. Regardless, you may otherwise wish to hold data in chronological bins and only rummage through the most recent ones.

  • ./convert-html-to-text.pl accepts a list of file handles from standard input *or* command line but not both. File handles may be in plain text format, HTML bookmark format or OpenOffice XML format (not the compressed archive). The function of this script is to concatenate files and distill them them to plain text. The original documents are seperated with ++++. -s suppresses unwanted content. -t trims low-priority fluff like Wikipedia's ubiquitous '[citation needed]'. This alone greatly improves the flow of text-to-speech. It also eliminates forum meta-data from SoylentNews and Reddit - although the result is a sea of humanity where posts become one block of text. (If you are alarmed by this then you will be displeased to discover subsequent steps to increase throughput.) -c skips WordPress comments and is likely to be expanded to skip comments of other systems.

  • ./dedup-lines.pl builds a hash of string keys and only outputs unique lines (and the webpage separator which is now stripped of its HTML paragraph tag). This process often requires more than 1GB RAM and 5 minutes to run. On 32 bit computers, it is likely to exhaust resources. This can be reduced slightly by setting -c for case insensitive matching and -n for numerically insensitive matching. I considered other methods to compact the data but I am concerned that it may run slower. The purpose of the script is to eliminate obvious duplication, such as website navigation. In the general case, only the first instance passes through and this can be manually trimmed before running text-to-speech. If a website suggests related articles then the number of novel suggestions is likely to decrease over a few pages. Unfortunately, this doesn't apply to spammy websites, such as the libelous PinkNews or plagiarist TheRegister (which look surprisingly similar at this stage of processing).

  • ./dedup-words.pl may be far too aggressive and can be skipped. It splits words from surrounding punctuation and eliminates 4-grams using a window and a 4 tier hash structure. -c make the 4-grams case insensitive and -n makes the 4-grams numerically insensitive. Overall, this process uses surprisingly little RAM or processing time. Furthermore, it has no effect on dense information but trims empty phrases commonly found in journalism. It removes duplicate articles; typically ones which have been syndicated around the "Intellectual Dark Web". Most often, unremarkable sentences with the same ending are truncated. From investigation, if it was trimmed, it very rarely contained anything of critical importance.

I may add further steps to reduce data and this may include grammatical analysis, word frequency analysis or text summaries to remove low-value sentences. At present, the output of these scripts may exceed 3 million words. Regardless, it can be read with the following scripts:-

  • ./speak-slow.pl reads at espeak's default rate of 150 words per minute (9000 words per hour) but with the pitch raised for use with headphones with speakers which are 32mm diameter.

  • ./speak-medi.pl reads at 300 words per minute (18000 words per hour). Ideal for fiction and technical documents.

  • ./speak-fast.pl reads at 360 words per minute (21600 words per hour) and eliminates common glue words for additional throughput. Ideal for bulk data, such as financial reports, pop culture and speculative research. This script has been set as high as 480 words per minute (28800 words per hour) but the ability to absorb information is limited without sustained use.

I may be inclined to make SIGUSR1 play/pause and SIGUSR2 read lines forwards/backwards through text.

espeak is able to write .wav to file or standard output. ffmpeg may read from standard input or concatenate .wav files. Therefore, it is possible for ffmpeg to convert espeak's output to .aac, .ogg or .mp3. Therefore, it is possible to make text-to-speech audio files which can be played, paused or skipped while away from your main computer. Keep a pen and paper handy so that you can follow references!

begin 644 script20200315.tar.gz
M'XL(`"O0;5X"`^T\^7?;.,[].7\%ZSJQ%%N6Y'-:7\UNVVW?F\[,:]*=_=9R
MLK)%VYKH\.B(<\C]VQ<D=?M(>GSMM&OU-99($`1!$`!!2.[$T1=>3:I)4EUN
MBA/;7#C8=05-=ZH+X]$7N22X6HT&^97;C7;ZEUR-1DM^)->;=:E1;\BUUB-)
M;C5:[4=(>O05+M_U5`>A1YZCNW_Z>.N0[ZO_3J\GCT7?=<2Q;HD+[!A(^/W@
M8&H[6)W,41$;6$+<\Y-W__@GC^X.$"OI?7!%12P71=%V.U"H3SE!8U4,*@5W
MSE6/>444Q1F#A:J%ZLU[19D]05ONL8:GNH4UCE;Q?!"P.X3_1*42'^&,FI:J
M)=9V%7<E]VB',4KZ]/B#.)2$IZ.[GU9%H#2%AK8`XKB+,B]>%&5Q%M*6JE.J
MXD5"-$+/04RD'OSQ&''5DEBJLC''()YN2KV9"3^8*Q+XX=,1G\%<[;D+1[>\
M*5<XE!KZH52C_PN5(FD[;([*\E-)"I\:\!3>UB,\J\P0D849WF1P[HWK89,K
M3#3&K@XRK\*)I)"%F"""9)`T1`A?ZQXIB@!6J1[7T<*4L4F7D>#=+#":(H&.
M34(!NE:=F8L$";ESM2G77-^$0M=V/"1<HAKJA]R`E5<U54N?8M>+",N1E25J
M&S'7MY^'T5Y@BWOSBU1YR`"CH011![3UK^_/H'EZ9$O;T2*0Y5PW,!<+^ERU
M-`/WNM!E/RW@83G(WU"Q%&<DID23#(/5@V0K56ZF3X,_%G@P"Q86_->F@;EH
M\$513\LZX-3L2:]4ZL0%X5CE2J$;=9@(!4*P^+FB(?6D#OSMUFODIUQ.8Z2D
MQ$,QX(<,1$X/).FZVJ,0G53%ZF#]C@R.$/I![+[&JJ:,A^?]T7%?G.5&L\;)
MS^D^9`YPN_M8$*K'`T'HB^E%GP%Q%ZJ%)H;JNKV"ZCCV4G!-U3`$I]"'IEU%
M)`"DO?X`!`M_.O7PM??0MMPIM=/!J7=C8#YB#VNKR+M:*N*`^UOP)C@-W@<O
M34#@V"`NI]!?C&9'ZQ``;>/*$"D>E=3R#ABE6GG641XK`X53>.5.68VV`Y^C
MR*SD:U!Q6\W1$XYI>KG27/&=@3B9@PS+?&KQ).">(PY/A'^/Q*$JW([$-,#A
M7'7G/8Y/EZ5M80UQ[L(`[0'D5PBV=9$C&.XH[*HG;Y,ZID6(PD"%<`4^*U3^
ML'6+*Z%2A2J72WSC,H+X2D&Q"IT-BV9BV"XF2WFCRHYJ(P/%GDFO8<$]:C2M
MOCY)*1._:C)-,&8-T&?K^<]"Y)A(<*8?2='J8'5P\(/Y?V[>_[>NL.,)8]N^
M-%7G4O!LP=!=[[/V`KO]?UF6VW7P_UNU1E-J0,$C(*9=J^_]_V_E_[O^&$UL
MPU`]?*%+=$&8-[%SXL[UJ=<)RSS=2XJ@[`ER@5$+]/KL[<_5@XQ'4SU6QJ_?
MO7Q%3,;HN,=^"MSPO-`?E?E"]5@$3SRT0_<U*T&S$FE6^JAFT`J15ML:=5]3
M.U*.#./PO`L_8$+#\I2E3+MJL0WLQO8T!&+LV0&38IEJ3>:VDV.:\F1X_G@$
M](IKW2I/HBU8"HGGJ)-+W9KET0RXZ7ABZ%JPF#JV&?B>>3%1S86JSRSZX-J^
M,\%\+^J)HISJUPB#IV-AVW>184]4`SEXBAUL3;!+3".(B6G:%EKBL:M[V,UU
M>SX%7^T9>"CB!OK/.45<<OJES@\4D1?GGK=P*2RVJDLH7F!-5ZNV,X.I6A_\
MV%02*(6`02=K!1L[M?#25<*]8:.R4@1V6TO?B<-S11P=@SN=T+5<+JNWV+'G
M6)OA*HR<$48893NFZB$?^&^C4[Q`-4E^6EWOF:/LTG1/QVXP<6X6GAU@4+FV
MJ4_<`%O8F=T$,VPO;/`R=&!W,,>JX<T#HHJQYP:P1%W;4@T!O%^0%QPDD!Z>
MS"W;L&<W_*=13P0#_3KQMA*O0L=@2`5']4#"W`!H()T"P4`X4#"SP718)K:\
MF%["ZL`%L?8$%QL&4$_I#$!,-<`AZ!I671Z0$V=,$6!I#J*9H9,@R2/Z%!7^
M5-D^-6-LW0).-1G;!GE3O<E<&5Q!OW:/H%2%*;B-K<H*G.$</MV:VDO5<7?A
M&SK^*"N\MJ%5':S!'&]K1Z`XZ$"I\@,&J1!0D-;U]IM$F`O!E&JF-7!E`%RA
MJDT9&+JI>[VF)&VFF_!](R/GJC$=JY?8N=D^;%"M(`P.B)V!`R+1,.,NKX"+
MD&:$:\-.Q?*(!&Q;Q(02Y<-'-R+]@XDRR=`7AC*P%[TK'5KX8^4(_NA:CVEK
MX$7QXPEB4G&K7(Q(-Y[J^;&RD$'\,MSREKKG82=DE;Z%UI!5;E:(&4:ILJH>
M\SG)(TM$TQUOE^1AZ\(?7KJC&'U,^>BNM6.57.D3O!UON!+3Z[!165N(]8<O
MRBO[^M[>+NC"A\T;6?*Q8-$URD<JN1WW45:J<\\T0&D,@'.#?(>:JALW)OR!
M;JO^Y3;V<:'<@M*%"?14G>FM2*T:^A0K@DLVVP$7Z29%X`>10EN.]=O`78!B
M([KL"AL9)78<L2:9B?(::_`UC;3OHA.D1U`3!K"!Y_&`C%J:ZFCW#'CJ.]X<
M.TPEDS]+W2%<ACO^07.]>8*9`Z#!)MG<O:RH9)(9SB^'1K(<:FMWM!4STN0V
MKMS,"AA4B@@P;6>.[;F7-["!<:N;E_JN_IFXB6O=$#[B&2"E2S]F.O2W,-29
M[[#NJ`LU42W;(O8YM+)Y*BCB`<5\8_N>#U8,YHK0<"+\6Q%`">65`P4;8Z;T
MJ3T;7/5"[9.C8&M7U/J$B$+SH0PHKMZ7ZIMT[EOZGW<A":L><X8[884T=($K
MJL,])P_\J!?"1?5RNEXF]7%S!WN^8R$2TUQE-BOQ3F7JJ+-DGT+V];3D@]CE
MSE[^Z^P9/S@!<Y\*+8;U8%S3$,Q3AU4XX&$/L%;!BT?)+HDKRI5BG1=QM$%8
M(6RX.(\_OT.@U(%9FGES1B3?;R0AB`Q^4ELI36V[E`E)1/R(F0%SM/`]BH.,
M'/Q-W.O%)S-Q/(NQ&W&I:!.9KZ1S%J4*X5(QJ%4TN`2[G,+.%0T:.^YFYA<*
MTD%DAKO0_?E-OWN"7K_#TYY28&)1-$9*H<]$`.Z[XDF_*P+<KOYK$>8([Z\_
M)_`?.^(\5:Q92%,BSYL)2Y"("0TY:NMY:M__!:A-:$A$"LE4J!C94N>`#(&<
M!0ZE$3V;$\8E1@Y=:HB>$]+U2QN0QM'`,ZWL7:UJVUKYNUK5*:'0)A(Z>F39
M#Z5^\^%+"E$<QTW.6'KDC&5VFSTY#(].*H7;"6R82B%H*3D)BDE/(;G>AN1Z
M)Q:\WF#MO&:U^VPI?202J9*H/H,A"A7ST?P?,27"0N$D$(HDRN$M79V>O4AZ
MV]#3ZB"%,$+WHT52?ZCX+VP_2.@W"@-_7BK(/?D?M3K<D_AO6VK6V_4:B?\V
M&_O\CV\1_V4N!,P^2_>`F\2/HD_QT70Y]F'HV:WN//X@GM-X1>Q4$?CBRU_^
M>5?Z[?<7I15+F8#"2,M$")&H*/&!X)H![!*P?H7^B,?Q3>8V][#VN*%@8]&6
MPJW%.RIV5MU3"=4Q$U.!HT/QL-;,>(["-`)()1%D#_SO,6D/,&H/,VL/,6QK
MIFV7<4M..M->I73=D'+.9'323RQ>OFG&LN42#>B&(G;OV=-:]@2+X!=39[E)
M4'_=F<_6#[O]T?;*S0?G:PP*6U#'/S^Z3(1?/*Q/TYA"UR["TWUQQOS$EZ_`
M3Z2!>5&,W45T\N+%Q8N3LY=0)\'CSR>G9Q=O?WWQYM6;ER]869]10CS)K"^>
M]Q<?Z(;%NB7G@QU!!:V-O9'M#@C!D?,^4JU7>U_C4^T_":\0!X#DJ'QV'N@]
M]K]=:S>)_:_7VK+4:%+[+\N-O?W_5N>_Q86#K\B&KPB#-NF-ZR\6](;$^\A-
M/DDT3(Q1`K%24LV%<!08GM`-9I[0#ZRQNQ!0@$WV8]$?;ZZSFSDV#'TA5*O5
MP-14=RX(@17^_NG;GE`(#&"\+2BEP(EN#(W<%`(G_#54\HL")_Q=&+YK6D)9
M%*+\4:YHJ2:N%*]4P\=\+R)7$"MT`)4:U39%;'EW%'+58Z!$BVP>*AEI<!+\
M/7@9O`E^"7X-W@=J,`EPH`=68`=^U'.ZM9QIK4Y\#P<3W9D$,Q)$#GS3"%P#
MQAZ`[IX%JN,$GFYH`(,UW4A281F9A)H5RWREF:AR)\I1(3X<@.C>S9:06!R/
M.E>.XB.FJ*R3SNR-52WMDH"L8C)"K9^JRD0[PIC;N1*F9Y$0>&)=P\9AJM:6
MAM=A./(5B<AF<QM3[7UKH4XNN=+K6JD"N+8A\Q-DC0<@:Z20I:*."'Q8@K,*
M=W',C>08,H^9YB,F?*:/6_+DXLIS!,YTI@21H_U,":<<*4\&87!V5.X,>/&(
M33'-=(O#CM&<L*S'D+RI;6B,/*!<SHI!+?^XA=H(,<%`.5!+A5]-X)FWK0>9
M^?B;$,4HW`5>LO;9+).#-0<VE_=ZJR\"6QLN7&_$<U![&PQGUZ-;GIP,Z79N
M?B4VF2%2HN5RH3C%*L.5"H2M9>264VLC(8M$WV*FL6P3=LL-2X71@.>HISC@
M%3E%5-2\6-O06Y(Z00^&13OL\_&ZRY]J=5B3Z(QUUBO:8FEC!3B-`Y+5$N\I
M'F_:5##^@*?U\MT[5"A*S]!2=2Q04L]0N$^S;`\1FA._OYI*6TQ/0.(V[MR;
MW+,ON7]/<M]^).-J;]N'L'UJO*/:DGV<1-IR2<>D>6H/<K`EY3B6*&)PCXZX
MU*CBG):A4KT8V<Z,Z`!Z.D.D@KXLP?8SOT>`R6;&YC-T,3TR#(@J>2IW>"[<
MP/#'H8(.H+D'!LE3/=VVD(6QAK4TF#)BC>L='J@V(W&*AC2O]?]N6Q[)!P`B
MX(DD*,,/K4F]`I*"'YZSA=)_%R<654G&%7%`DYW5*LN>A#LL!8*R!K10EC7Y
MH7<YDF4=D-<S=F=0QPU4-'?PM%?X0[U2F:/\;'A>&!T7THW55-N(3N(N9>A,
M)1]LGL<\L2>WOJDJY=?JK7^IDXX,'>E:;[VK-99\=%==3M.O`L(;/DI/']_0
MQ'0$0L"=^E;PUK:",Q^[P>]8L^#G;.X[;O#*T8-3%18VKZDWB&15T]P!1+I#
MHW(%/6=B]8S]#$]^&[W-IKNO2=`Z*9Q-CEOY@88]53?<0F;*[F%%DLWR,`FA
M1_TC/C%_88%8E%&QMB[`K)I##.JN7EG1,T%J#;-F.2*1N,]K_(<QD[DM1+DT
MA>S0XMY(+O_PY&]O?CM]/XH/(<&T=,&U$H^(D:>GD+54O\G+!U1\LV+/RM))
MC3$TR7S(`9.B3;#DO)KZ)'P_L7M=18[?+E"L]4;<ZZ$LM$;!;_Q.L`T#WD!!
M_B6%F&$DM>:(J.)P,HBX'<0GFP!%H>EN1^[DPQA,U:^%,HB`A0=3U*T4Q%",
MUDZLO%(\TW0?)6?,5@;434#I3FL'Z"0!I7LQ><V4K=MJDL3I/$.^=6G92PLM
M5,?$'G;`)(:H4[::ADQBCR1SVO;PV`[><KYV1/P\6IT)[FP*[3#3FHWML-:D
MAD9W]E&=+Q[_`7OO+P3"8?>+O?Y[7_Q'EIHL_[_>:+;;;8F\_UN7I'W\YYO%
M?RS?Q"S:H[KTYO.5H96H+8K]@1J.]/]U-=P&-23E]!`ME)F3+T6>.QE53#<#
M()M&ZOF(4MY5(0/+0V]X,RT"9U00#I?9OW`SFM[NA%8M)BIB;O)ZM8NQ=<=Z
M2X(X^6;$!*7@>DE4::\L_U?TO^\87T[[/T#_U]GY?[U=:[?E&NC_9KO5VNO_
M;_G^EX-5%C(\IB_%)@&]_%M@WTD,Y_[`#7MUEU3$2B]W?KQA>Y`UB&$>Y!%A
M'J<<SJ5*V@FF>5!1E9RKRAPR9++^YG)B4C%E$I?:XM,WG8+4FR"\F-/LF&7-
MKIN#N43/$-9,0=1@K_/_5_4_>?7[Z_G_M9;<"/W_%M0TJ/\ORWO]O_?_OX'_
M3XR?.AX[^&K'ZP099S\Z-XU]?7NKI\]`MSCZJ8.Q&4N\!GNF$?;?'QIYKEO1
MF788C5-*`WISS(NZ78F")I2F:!9UB^\^C6A[;OM>#XK6F/U\J5N:O93B#V.D
MT[&>YA/[?7<>MXA,'$,:I2V%M7*/.V)\YHHA_%`>\96UPOJFPN:FPO:(#TFD
M@^$RW]Y)#;HO)01G+"+A_%V(3@:A624/<OJAEGZHCU:9[X!\,I;T!T-"SH6U
M23GC+XRNDH&0/@:"B96\->-L6]_;\.WN1\HGK>T4D8^JEC>*0/Y#692*GIR7
M4<*A+,VI4\E(6``H)2U,[;#OM)0JM#+Y<@@\#=/-!'E$OD!&]^G93,GXK+GP
M%_BT2-[^Z]-I\O&/B>U;GY_^=>_^KP:;O6C_UR#YWW*[7=_;_[_&_F_3ES\.
M;4/K'<+V@S[`;ZAK?Y2]8!1R`R(*7)QS0[,QZ/DO7\CF"1>!!7=%.=;@J^V[
M1TYF1Q=L(YC=`V[:4H;*IT@RVI@N+9JZY8>WZZ\R=>+]9_I%G^SFDNTK@>;4
M>!,C"'.;VQ>&W9?+V=%M0@J--R*E#,HC)0/)(PV58YGVJ'@"!5*\*%&9Z,S]
MMO3_3_^["ZQ>"E/5];Y@`/`>_2_5ZU3_UZ5Z3:Z3_%^Y)<O[\Y]OIO]S^O$B
M?_QQL>DM(%:JC#DU($DV<QQX=C#&@3T-=#=8JB[YP@J*#^^C;V962@'"5.J0
MX*)Z2T+"`K4E]HYS_*$\P!Z_@1Q_QRXV.?%'V]*?;-N'KSY]_9M8T[_B^J\U
MR?IO-EJU5K-5H^>_LM3<K__O9/WO6L_2?CU_A^O?->SE5U__9/\GUVL-^OUW
MN;G?__T`ZU]N[M?_=[7^?<<0IMB;S-D>P)U_E?7?J#'_OPT[`8F>_[3W^5]?
M;?V3M3]6W3DL_!GVD/`>E=[:M[IAJ&*S*B'N7[+<03_#-OP:Z:V?6AWD7#VK
MUZL2C_Z!)Y>V6)-DDL0AHU>Z@Z?VM4@J2TBP)DBX1H*'FDA8(AD)@@.;>-L4
MEJH.W>A(0+4^8M+V'XU\OZE\^'^'YJ$F'+X^?'MX^I^J8<_V:_X;K7_J`WRE
M]=]JQNN_+5'[WZ[OO__QK=:_0#\>2C[SBGNPLB5SKQ+VU_[:7_MK?^VO_;6_
6]M?^VE_[:W_MK^_[^B_F?RJZ`'@`````
`
end

(Usual instructions for uudecode process.)

Display Options Threshold/Breakthrough Reply to Article Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 2, Funny) by aristarchus on Tuesday March 17 2020, @12:48AM

    by aristarchus (2645) on Tuesday March 17 2020, @12:48AM (#972059) Journal

    Scraping the submission list on SoylentNews! Careful, cafebabe! You could get censored!

  • (Score: 2) by Bot on Tuesday March 17 2020, @10:45AM

    by Bot (3902) on Tuesday March 17 2020, @10:45AM (#972137) Journal

    >And I often have less than 250MB of persistent storage.

    But, what prevents you to hook up an sd card or a pendrive?

    --
    Account abandoned.
  • (Score: 3, Informative) by canopic jug on Tuesday March 17 2020, @10:56AM

    by canopic jug (3949) on Tuesday March 17 2020, @10:56AM (#972138) Journal

    Interesting. It's a very clear style of writing and there seem to be few dependencies. Thank you for posting the tar ball.

    Another approach would have been to use an established XHTML parser, preferably one that supports XPaths. Those all have built-in functions to extract pure text. The real gain with XPath is to be able to target specific ranges of elements. They even enable selection by elements matching specific criteria.

    --
    Money is not free speech. Elections should not be auctions.
  • (Score: 2) by hubie on Sunday March 22 2020, @01:39PM

    by hubie (1068) on Sunday March 22 2020, @01:39PM (#974123) Journal

    This is a very informative and useful post for me. Thank you for taking the time to write it.

(1)