Egej team members:

Jacek Kotowski
Edna
John Flynn
Gaston Becerra
Erin Hodgess

Addressed area

Challenge 4 “Revealing the content” Based on the below data of content find out what are the key topics described in articles?

When building an analysis that helps to understand the corpus of text, you can consider below questions

What are the key topics described in articles?
Can you divide articles in groups with meaningful characteristics and the story?

Here is a plan…

load for now 100 articles, change head(x) to load more, remove to load all but for now I would leave it, will continue if error (error handling is not done, just ignore)
clean them, remove digits (easily you can add other regex)
concatenate to words… remove stopwords, have a provision for another dictionary to remove more user defined words, bring back texts (I find it very fast for specific words removal,
concatenate again to skipgrams, 1 to 2 words expressions, and as I understand it correctly 0 to 1 words distance between,
run lsa from text2vec to reduce dimensions (if more documents you can slowly increase the number to, say 50, but if it is a sample I could run it to get like 5 dimensions, more would throw error.)
generate distance matrix between documents (also text2vec)
now we can run clustering algorithm and show which documents “belong”
we can improve on that a lot further by… using standard “elbow techique” to show how many clusters are best. Also I am bad at documenting, visualizing. We must have a look what these clusters represent. I am thinking of perhaps using tidylo package by Julia Silge et.al.) to inspect what vocabulary decides about belonging to specific clusters.
Further we can join the clusters values with the original file and (NOTDONE) if someone is perhaps good at EDA of networks and visualisations we can find some dependence on users and their relations with the clusters of topics…

# Functions to quickly get text clean.
# pacman::p_load_gh("trinker/lexicon",
#                   "trinker/textclean")

pacman::p_load(hackeRnews, tidyverse, tidytext,   jsonlite, 
               rvest,      lexicon,   textclean,  hunspell)
future::plan(future::multiprocess)

theme_set(theme_minimal())

Load data.

We are especially interested in the content articles people quote. For debugging we will use initially 100 documents.

articles <- 
  fromJSON(readLines('data/articles.json'), 
           simplifyVector = T, 
           simplifyDataFrame = T) %>% 
  tibble() %>% 
  head(1000)

Load articles

Then we will load html and extract all the text contained within <p> tags.

df_texts_raw <- tibble()

for (i in 1:nrow(articles)) {
  url <- articles$url[[i]]
  tryCatch({
  
  df_temp <-
    tibble(ID = i,
           texts = 
             read_html(url) %>% 
             html_nodes("p") %>% 
             html_text())
  df_texts_raw <- bind_rows(df_texts_raw, df_temp)},
  error=function(e){})

}

Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://blog.torproject.org/new-release-tor-browser-100)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://torrentfreak.com/copyright-troll-loses-legal-battle-and-must-pay-172173-200920/)
Warning in for (j in seq_along(xs)) { :
  closing unused connection 12 (https://www.justbartek.ca/analytics-without-google/)
Warning in for (j in seq_along(xs)) { :
  closing unused connection 11 (https://www.corvetteactioncenter.com/history/images/zora-arkus-duntov-letter.pdf)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://github.com/tamuhey/hack-assembler-rust)
Warning in for (i in seq_len(n)) { :
  closing unused connection 11 (https://www.economist.com/international/2020/06/06/fake-news-is-fooling-more-conservatives-than-liberals-why)
Warning in for (pair in pairs) { :
  closing unused connection 11 (https://blog.torproject.org/new-release-tor-browser-95)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://mlscrapedemo.herokuapp.com/)
Warning in length(x) :
  closing unused connection 11 (https://torrentfreak.com/publishers-sue-the-internet-archive-over-its-open-library-declare-it-a-pirate-site-200601/)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://ballparq.games)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://www.findyour.blog/)
Warning in for (name in names(public_methods)) lockBinding(name, public_bind_env) :
  closing unused connection 11 (https://apps.apple.com/us/app/omsave/id1514795049)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://kirankamath.netlify.app/blog/matrix-calculus-for-deeplearning-part2/)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 12 (https://lbry.tv/@Lunduke:e/LinuxSucks2020:b)
Warning in for (i in seq_along(specs)) { :
  closing unused connection 11 (https://databaselog.com)

rm(df_temp)

head(df_texts_raw)

	texts <chr>
	Cal Paterson \|\n Home About
	September 2020
	Mozilla is in an absolute state: high\n overheads, falling usage of Firefox, questionable sources of revenue and\n now making big cuts to engineering as their income falls.
	Mozilla recently announced that they would be dismissing 250 people.\n That's a quarter of their workforce so there are some deep cuts to their\n work too. The victims include: the MDN docs (those are the web standards\n docs everyone likes better than w3schools), the Rust compiler and even some\n cuts to Firefox development. Like most people I want to see Mozilla do well\n but those three projects comprise pretty much what I think of as the whole\n point of Mozilla, so this news is a a big let down.
	The stated reason for the cuts is falling income. Mozilla largely relies\n on "royalties" for funding. In return for payment, Mozilla allows big\n technology companies to choose the default search engine in Firefox - the\n technology companies are ultimately paying to increase the number of\n searches Firefox users make with them. Mozilla haven't been particularly\n transparent about why these royalties are being reduced, except to blame\n the coronavirus.
	I'm sure the coronavirus is not a great help but I suspect the bigger\n problem is that Firefox's market share is now a tiny fraction of its\n previous size and so the royalties will be smaller too - fewer users, so\n fewer searches and therefore less money for Mozilla.

Next we join all paragraphs within each article to a single text per link.

df_texts <- 
  df_texts_raw %>% 
  group_by(ID) %>% 
  summarise(texts = paste0(texts, collapse= " ")) %>% 
  ungroup()

`summarise()` ungrouping output (override with `.groups` argument)

head(df_texts)

	texts <chr>
	Cal Paterson \|\n Home About September 2020 Mozilla is in an absolute state: high\n overheads, falling usage of Firefox, questionable sources of revenue and\n now making big cuts to engineering as their income falls. Mozilla recently announced that they would be dismissing 250 people.\n That's a quarter of their workforce so there are some deep cuts to their\n work too. The victims include: the MDN docs (those are the web standards\n docs everyone likes better than w3schools), the Rust compiler and even some\n cuts to Firefox development. Like most people I want to see Mozilla do well\n but those three projects comprise pretty much what I think of as the whole\n point of Mozilla, so this news is a a big let down. The stated reason for the cuts is falling income. Mozilla largely relies\n on "royalties" for funding. In return for payment, Mozilla allows big\n technology companies to choose the default search engine in Firefox - the\n technology companies are ultimately paying to increase the number of\n searches Firefox users make with them. Mozilla haven't been particularly\n transparent about why these royalties are being reduced, except to blame\n the coronavirus. I'm sure the coronavirus is not a great help but I suspect the bigger\n problem is that Firefox's market share is now a tiny fraction of its\n previous size and so the royalties will be smaller too - fewer users, so\n fewer searches and therefore less money for Mozilla. The real problem is not the royalty cuts, though. Mozilla has already\n received more than enough money to set themselves up for financial\n independence. Mozilla received up to half a billion dollars a year (each\n year!) for many years. The real problem is that Mozilla didn't use\n that money to achieve financial independence and instead just spent it each\n year, doing the organisational equivalent of living hand-to-mouth. Despite their slightly contrived legal structure as a non-profit that\n owns a for-profit, Mozilla are an NGO just like any other. In this article\n I want to apply the traditional measures that are applied to other NGOs to\n Mozilla in order to show what's wrong. These three measures are: overheads, ethics and results. One of the most popular and most intuitive ways to evaluate an NGO is to\n judge how much of their spending is on their programme of works (or\n "mission") and how much is on other things, like administration and\n fundraising. If you give money to a charity for feeding people in the third\n world you hope that most of the money you give them goes on food - and not,\n for example, on company cars for head office staff. Mozilla looks bad when considered in this light. Fully 30% of all\n expenditure goes on administration. Charity Navigator, an organisation that\n measures NGO effectiveness, would give them \n zero out of ten on the relevant metric. For context, to achieve 5/10 on\n that measure Mozilla admin would need to be under 25% of spending and, for\n 10/10, under 15%. Senior executives have also done very well for themselves. Mitchell\n Baker, Mozilla's top executive, was paid $2.4m in 2018, a sum I personally\n think of as instant inter-generational wealth. Payments to Baker have more\n than doubled in the last five years. As far as I can find, there is no UK-based NGO whose top executive makes\n more than L1m ($1.3m) a year. The UK certainly has its fair share of big\n international NGOs - many much bigger and more significant than\n Mozilla. I'm aware that \n some people dislike overheads as a measure and argue that it's possible\n for administration spending to increase effectiveness. I think it's hard to\n argue that Mozilla's overheads are correlated with any improvement in\n effectiveness. Mozilla now thinks of itself less as a custodian of the old Netscape\n suite and more as a 'privacy NGO'. One slogan inside Mozilla is: "Beyond\n the Browser". Regardless of how they view themselves, most of their income comes from\n helping to direct traffic to Google by making that search engine the\n default in Firefox. Google make money off that traffic via a big targeted\n advertising system that tracks people across the web and largely without\n their consent. Indeed, one of the reasons this income is falling is because\n as Firefox's usage falls less traffic is being directed Google's way and so\n Google will pay less. There is, as yet, no outbreak of agreement among the moral philosophers\n as to a universal code of ethics. However I think most people would\n recognise hypocrisy in Mozilla's relationship with Google. Beyond the\n ethical problems, the relationship certainly seems to create conflicts of\n interest. Anyone would think that a privacy NGO would build anti-tracking\n countermeasures into their browser right from the start. In fact, this was\n only added relatively recently (in\n 2019), after both Apple (in\n 2017) and Brave (since release) paved the way. It certainly seems like\n Mozilla's status as a Google vassal has played a role in the absence of\n anti-tracking features in Firefox for so long. Another ethical issue is Mozilla's big new initiative to move into VPNs. This doesn't make a lot of\n sense from a privacy point of view. Broadly speaking: VPNs are not a useful\n privacy tool for people browsing the web. A VPN lets you access the\n internet through a proxy - so your requests superficially appear to come\n from somewhere other than they really do. This does nothing to address the\n main privacy problem for web users: that they are being passively tracked\n and de-anonymised on a massive scale by the baddies at Google and\n elsewhere. This tracking happens regardless of IP address. When I tested Firefox through Mozilla\n VPN (a rebrand of Mullvad VPN) I\n found that I could be de-anonymised by browser fingerprinting - already a\n fairly widespread technique by which various elements of your browser are\n examined to create a "fingerprint" which can then be used to re-identify\n you later. Firefox does not include as many countermeasures against this as\n some other browsers (this is a correction - I previously\n said Firefox contained none but it's been pointed out to me that since\n earlier this year it does block some kinds of fingerprinting). Another worry is that many of these privacy focused VPN services have a\n nasty habit of turning out to keep copious logs on user behaviour. A few\n months ago several "no log" VPN services inadvertently released terabytes\n of private user data that they had promised not to collect in a massive\n breach. VPN services are in a great position to eavesdrop - and even if\n they promise not to, your only option is to take them at their word. I've discussed the Mozilla chair's impressive pay: $2.4m/year. Surely\n such impressive pay is justified by the equally impressive results Mozilla\n has achieved? Sadly on almost every measure of results both quantitative\n and qualitative, Mozilla is a dog. Firefox is now so niche it is in danger of garnering a cult following:\n it has just 4% market share, down from 30% a decade ago. Mobile browsing\n numbers are bleak: Firefox barely exists on phones, with a market share of\n less than half a percent. This is baffling given that mobile Firefox has a\n rare feature for a mobile browser: it's able to install extensions and so\n can block ads. Yet despite the problems within their core business, Mozilla, instead of\n retrenching, has diversified rapidly. In recent years Mozilla has\n created: Many of the above are now abandoned. Sadly Mozilla's\n annual report doesn't break down expenses on a per-project basis so\n it's impossible to know how much of the spending that is on\n Mozilla's programme is being spent on Firefox and how much is being spent\n on all these other side-projects. What you can at least infer is that the side-projects are expensive.\n Software development always is. Each of the projects named above (and all\n the other ones that were never announced or that I don't know about) will\n have required business analysts, designers, user researchers, developers,\n testers and all the other people you need in order to create a consumer web\n project. The biggest cost of course is the opportunity cost of just spending that\n money on other stuff - or nothing: it could have been invested to build an\n endowment. Now Mozilla is in the situation where apparently there isn't\n enough money left to fully fund Firefox development. Mozilla can't just continue as before. At the very least they need to\n reduce their expenses to go along with their now reduced income. That\n income is probably still pretty enormous though: likely hundreds of\n millions a year. I'm a Firefox user (and one of the few on mobile, apparently) and I want\n to see Mozilla succeed. As such, I would hope that Mozilla would cut their\n cost of administration. I'd also hope that they'd increase spending on\n Firefox to make it faster and implement those privacy features that other\n browsers have. Most importantly: I'd like them to start building proper\n financial independence. I doubt those things will happen. Instead they will likely keep the\n expensive management. They have already cut spending on Firefox. Their\n great hope is to continue trying new things, like using their brand to sell\n VPN services that, as I've discussed, do not solve the problem that their\n users have. Instead of diversifying into yet more products and services Mozilla\n should probably just ask their users for money. For many years the Guardian\n newspaper (a similarly sized organisation to Mozilla in terms of staff) was\n a financial basket case. The Guardian started asking their readers for\n money a few years ago and seems to be on firmer financial footing\n since. Getting money directly has also helped align the incentives of their\n organisation with those of their readers. Perhaps that would work for\n Mozilla. But then, things are different at the Guardian. Their chief exec\n makes a mere L360,000 a year. Please do feel free to send me an email about this article,\n especially if you disagreed with it. If you liked it, you might like other things I've\n written. I write a new article roughly once a month. You can get notified\n when I write something new by email alert or by RSS feed. Another notable technology NGO with huge expenses is The Wikimedia\n foundation. You can read more about their problems in Wikipedia\n has Cancer which uses a very dodgy metaphor but is otherwise a good\n study of incredible spending growth there. Wikimedia has the advantage over\n Mozilla that all contributors to its main project are unpaid - which only\n makes the expenses all the more mindblowing. Wikimedia, like Mozilla, has\n had a lot of side projects. At one point they were trying, in secret, to\n build a whole\n new search engine. A good example of a charity that shut down due to poor expense control\n is Kids Company, a\n British charity for helping vulnerable children. The Kids Company debacle\n shows that shutdown can be quite abrupt. The British government provided\n most of the funding but (eventually) lost faith and ceased their annual\n grant which resulted in the charity being immediately liquidated. A House\n of Commons committee released \n a good report. Mozilla's legal structure as a non-profit that owns a for-profit arm\n makes their government filings ("Form 990"s) of pretty limited use as they\n aren't consolidated\n and don't contain any information about the majority of Mozilla activities.\n The most useful thing about these documents is that people who are on the\n board of the parent have to disclose pay from the subsidary (in Part IX,\n generally on page 10). Propublica\n have a complete set. More useful is Mozilla's annual report, which is consolidated. I haven't\n been able to find a master index of these but you can get there via the\n "State of\n Mozilla" pages - you'll need to edit the year in the url to get\n previous annual reports. I used \n statcounter's usage data. I haven't included any 2020 data as the year\n isn't out yet but it's even worse than 2019 for Firefox.
	\n By choosing “I agree” below, you agree that NPR’s sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR’s sponsors, provide social media features, and analyze NPR’s traffic. This information is shared with social media services, sponsorship, analytics and other third-party service providers.\n See details.\n NPR’s Terms of Use and Privacy Policy.
	Should you join a big company or start a startup? This frequently debated question paints a picture of a world where the only choice is between being a cog at a giant semi-monopoly, or taking investment money in the hopes of one day growing to be head cog at a giant semi-monopoly. Role models matter. So I made a list of small companies that I admire. Neither giants nor startups - just people making a living writing software on their own terms. Sqlite is an existence proof that you don't need a thousand engineers to have an impact. It's been around since the early 90s, is one of the most shipped pieces of software in existence and is widely respected for its exceptional testing methodology. Yet it was written almost entirely by a few people: Their growth plans: Hwaci is a small company but it is also closely held and debt-free and has low fixed costs, which means that it is largely immune to buy-outs, take-overs, and market down-turns. Hwaci intends to continue operating in its current form, and at roughly its current size until at least the year 2050. We expect to be here when you need us, even if that need is many years in the future. Sqlite is also interesting because although the code is open source they operate on a cathedral model, generally refusing or rewriting outside contributions. Pinboard is a single-person operation that has been making a solid income year on year since 2010. It somehow still leaves Maciej enough free time to go around teaching politicians how to not get hacked and explaining privacy to the senate. All the while destroying his competition using highly advanced tactics such as listening to his customers and not operating at a loss. But if Maciej had done nothing else he would still be on this list for his beautiful talk Barely Succeed! It's Easier! We live in a remarkable time when small teams (or even lone programmers) can successfully compete against internet giants. But while the last few years have seen an explosion of product ideas, there has been far less innovation in how to actually build a business. Silicon Valley is stuck in an outdated 'grow or die' mentality that overvalues risk, while investors dismiss sustainable, interesting projects for being too practical. So who needs investors anyway? I'll talk about some alternative definitions of success that are more achievable (and more fun!) than the Silicon Valley casino. It turns out that staying small offers some surprising advantages, not just in the day-to-day experience of work, but in marketing and getting customers to love your project. Best of all, there's plenty more room at the bottom. If your goal is to do meaningful work you love, you may be much closer to realizing your dreams than you think. Another single-person operation, tarsnap has reported been profitable since 2009 and shows no sign of going anywhere, even lowering it's prices over the years as storage became cheaper. It's saved me on multiple occasions and it's so simple, reliable and cheap that I think of it almost like a utility - tarsnap having an outage would be more surprising to me than water not coming out of my taps. One of the most interesting things about tarsnap is that it could easily be making more money. If tarsnap had investors or shareholders, they would have already forced Colin to do so. In other words, the only reason that Colin gets to continue running the kind of business that he wants to run is that he owns it entirely and noone can force him to maximize profit instead. Version 1.0 came out in 2008 and they're still going strong today, making enough to hire an extra developer and launch a separate product. Sublime is notable for surviving despite massive open-source competition in a field where people hate to pay for tools. As Tristan Hume points out, it's really the only editor that has managed to remain both fast and consistent while still supporting a large plugin ecosystem. These kind of end-to-end qualities are perhaps one of the main advantages for small teams with a long time horizon. Zig has been around only a few years but I'm already wildly impressed by the quality of thought and engineering that has gone into it. The Zig Software Foundation is a non-profit, currently funded entirely by donations and working towards being able to employ a second full-time developer. They explicitly chose to be a non-profit out of concern over how profit incentives creep into daily decision making: In You Weren’t Meant to Have a Boss, Paul Graham makes an analogy between animals in the zoo (employees) and animals in the wild (startup founders). I think he’s on to something, but when you start a startup, you still have a boss. In fact, you have the same boss. At the end of the day, it is an inescapable fact that you must do what makes a profit, for the shareholders. The difference in incentives makes a structural difference that permeates every part of an organization. Windows users wake up one day and find ads in their start menu. Will Debian Linux ever try to put ads into any of its software? The concept is absurd. Zig was explicitly designed to be a small, simple language, allowing it to be maintained indefinitely by a small, committed team. ZSF is a small organization and makes efficient use of monetary resources. I don’t see the need for ZSF to grow beyond a handful of people. As Maciej puts it, one of the advantages of being small is that you can speak with a human voice. This comes across clearly from the zig team, who are resolutely human instead of projecting some impassive facade of professionalism.
	Language & Linguistics, New York, USA You know how to have a polite conversation, right? You listen, wait for a pause, say your bit, then shut up so someone else can speak. In other words, you take your turn. You’re obviously not from New York. To an outsider, someone from, say, Toronto or Seattle or London, a conversation among New Yorkers may resemble a verbal wrestling match. Everyone seems to talk at once, butting in with questions and comments, being loud, rude and aggressive. Actually, according to the American linguist E J White, they’re just being nice. When they talk simultaneously, raise the volume and insert commentary (‘I knew he was trouble’, ‘I hate that!’), New Yorkers aren’t trying to hijack the conversation, White says. They’re using ‘cooperative overlap’, ‘contextualization cues’ (like vocal pitch) and ‘cooperative interruption’ to keep the talk perking merrily along. To them, argument is engagement, loudness is enthusiasm and interruption means they’re listening, she writes. Behaviour that would stop a conversation dead in Milwaukee nudges it forward in New York. Why do New Yorkers talk this way? Perhaps, White says, because it’s the cultural norm among many of the communities that have come to make up the city: eastern European Jews, Italians, and Puerto Ricans and other Spanish speakers. As for the famous New York accent, that’s something else again. White, who teaches the history of language at Stony Brook University, New York, argues that ‘Americans sound the way they do because New Yorkers sound the way they do’. In You Talkin’ to Me? she makes a convincing case that the sounds of standard American English developed, at least in part, as a backlash against immigration and the accent of New York. Although the book is aimed at general readers, it’s based on up-to-the-minute research in the relatively new field of historical sociolinguistics. (Here a New Yorker would helpfully interrupt, ‘Yeah, which is what?’) Briefly, it is about how and why language changes. Its central premise is that things like social class, gender, age, group identity and notions of prestige, all working in particular historical settings, are what drive change. Take one of the sounds typically associated with New York speech – the oi that’s heard when ‘bird’ is pronounced boid, ‘earl’ oil, ‘certainly’ soitanly, and so on. Here’s a surprise. That oi, White says, was ‘a marker of upper-class speech’ in old New York, a prestige pronunciation used by Teddy Roosevelt and the Four Hundred who rubbed elbows in Mrs Astor’s ballroom. Here’s another surprise. The pronunciation is now defunct and exists only as a stereotype. It retired from high society after the First World War and by mid-century it was no longer part of New York speech in general. Yet for decades afterwards it persisted in sitcoms, cartoons and the like. Although extinct ‘in the wild’ (as linguists like to say), it lives on in a mythological ‘New York City of the mind’. Another feature of New York speech, one that survives today, though it’s weakening, is the dropping of r after a vowel in words like ‘four’ (foah), ‘park’ (pahk) and ‘never’ (nevuh). This was also considered a prestige pronunciation in the early 1900s, White says, not just in New York City but in much of New England and the South as well, where it was valued for its resemblance to cultivated British speech. Until sometime in the 1950s, in fact, it was considered part of what elocutionists used to call ‘General American’. It was taught, the author writes, not only to schoolchildren on the East Coast, but also to aspiring actors, public speakers and social climbers nationwide. But here, too, change lay ahead. While r-dropping is still heard in New York, Boston and pockets along the Eastern Seaboard, it has all but vanished in the South and was never adopted in the rest of the United States. Here the author deftly unravels an intriguing mystery: why the most important city in the nation, its centre of cultural and economic power, does not provide, as is the case with other countries, the standard model for its speech. To begin with, White reminds us, the original Americans always pronounced r, as the British did in colonial times. Only in the late 18th century did the British stop pronouncing r after a vowel. Not surprisingly, the colonists who remained in the big East Coast seaports and had regular contact with London adopted the new British pronunciation. But those who settled inland retained the old r and never lost it. (As White says, this means that Shakespeare’s accent was probably more like standard American today than Received Pronunciation.) Posh eastern universities also helped to turn the nation’s accent westward. Towards the end of the First World War, White says, Ivy League schools fretted that swelling numbers of Jewish students, admitted on merit alone, would discourage enrolment from the Protestant upper class. Admissions practices changed. In the 1920s, elite schools began to recruit students from outside New York’s orbit and to ask searching questions about race, religion, colour and heritage. The result, White says, was that upper-crust institutions ‘shifted their preference for prestige pronunciation toward the “purer” regions of the West and the Midwest, where Protestants of “Nordic” descent were more likely to live’. Thus notions about what constituted ‘educated’ American speech gradually shifted. Another influence, the author writes, was the Midwestern-sounding radio and television ‘network English’ that was inspired by the Second World War reporting of Edward R Murrow and the ‘Murrow Boys’ he recruited to CBS from the nation’s interior. Murrow’s eloquent, authoritative reports, heard by millions, influenced generations of broadcasters, including Walter Cronkite, Chet Huntley and Dan Rather, who didn’t try to sound like they had grown up on the Eastern Seaboard. The voice of the Midwest became the voice of America. This book takes in a lot of territory, all solidly researched and footnoted. But dry? Fuhgeddaboutit. White is particularly entertaining when she discusses underworld slang from the city’s ‘sensitive lines of business’ and she’s also good on song lyrics, from Tin Pan Alley days to hip-hop. She dwells lovingly on the ‘sharp, smart, swift, and sure’ lyrics of the golden age of American popular music – roughly, the first half of the 20th century. It was a time when New York lyricists, nearly all of them Jewish, preserved in the American Songbook not only the vernacular of the Lower East Side but also the colloquialisms of Harlem and the snappy patois of advertising. You Talkin’ to Me? is engrossing and often funny. In dissecting the exaggerated New York accents of Bugs Bunny and Groucho Marx, White observes that ‘Bugs even wielded his carrot like Groucho’s cigar’. And she says that the word ‘fuck’ is so ubiquitous in Gotham that it has lost its edge, so a New Yorker in need of a blistering insult must look elsewhere. ‘There may be some truth to the old joke that in Los Angeles, people say “Have a nice day” and mean “Fuck off,” while in New York, people say “Fuck off” and mean “Have a nice day.”’ Only a selection of our reviews and articles are free. Subscribers receive the monthly magazine and access to all articles on our website. Follow Literary Review on Twitter 'This book takes in a lot of territory, all solidly researched and footnoted. But dry? Fuhgeddaboutit.'\nPatricia T O'Conner on E J White's 'You Talkin' To Me? The Unruly History of New York English'.\nhttps://literaryreview.co.uk/tawk-of-the-town 'The identification of a mighty force sparkling intermittently seems to me to constitute the finest and most consistent poetic achievement of Goudie’s book.' \nCandia McWilliam on @lachlangoudie's 'The Story of Scottish Art'. \nhttps://literaryreview.co.uk/all-that-glisters Though 'the hotel had a reputation as the area’s best, its staff were not used to looking after world leaders, so the arrival of Cuba’s new strongman, Fidel Castro, came as something of a shock.'\n@dcsandbrook on @simonhallwriter's 'Ten Days in Harlem'.\nhttp://ow.ly/I63850Bvs10 Designed and built by Sam Oakley
	If you work on web apps, you may sometimes feel that your tools have forsaken you. And you’re not wrong. Because of the way modern tech stacks are set up, it’s getting harder and harder to make sure your code is doing what it’s supposed to. In this blog post, we’ll show you a bug that slips through the cracks today, explain why it’s nobody’s fault, and show you how we diff API behaviors to catch bugs (while generating API specs along the way!!). You work on a web app. Your co-worker and #bff, Aki, opened a pull request to remove a property from a commonly used API. You two are tight, so you jump in to review the PR. Aki’s PR includes a link showing that this property’s usage has been removed from the website, once again proving why Aki is your #bff. You +1 the PR enthusiastically, with both of you believing that Aki has taken into account all of the dependencies. The change gets merged. Then: BETRAYAL! There are now outages in the mobile clients, and it’s late nights for you and Aki. (Good thing you started out liking Aki so much.) It turns out that Aki’s API had been so good that the mobile APIs had started depending on the removed property, unbeknownst to both you and Aki. If this has happened to you, you are not to blame! These bugs are nasty little things: they have to do with dependencies on the code you wrote rather than the code itself, making it hard to catch them through source diffs. In fact, as soon as bugs cross API boundaries, they become extremely hard to catch because they elude common testing and software analysis (think static and dynamic analysis) solutions. Developers are left to rely on documentation, word of mouth, and crossing their fingers really hard. To catch these kinds of bugs, you can simply install Akita to analyze API-impacting changes whenever you make a pull request. Here is an example of a comment that Akita leaves on your pull request: Akita enables you to compare API behavior across pull requests, multiple environments (ex production vs. test), and even pre-existing specs. If Aki had used Akita to compare the API behavior between their production and test environments, they surely would have caught the bug and saved you some late-night debugging! By diffing on behaviors rather than code, Akita summarizes how a pull request changes your API, including: Endpoints added that expose new functionality Removed/changed parameters that may break existing clients Simple type changes (ex string to int) that can pollute data pipelines Complex type changes (ex phone number to datetime) that can break dependencies <U+0001F477><U+0001F3FB><U+200D>+<U+FE0F> Coming soon: Impacted Clients, telling you exactly which mobile and web clients are affected Just this week, we released our GitHub integration to deliver insight on every pull request. You can now use Akita without changing any code or config files, without having to proxy—and as part of your normal developer workflow. Akita does this by showing you semantic diffs between implicit API contracts. In other words, this means Akita does some magic to figure out what your API normally does and then diffs on that, instead of diffing syntactically on the source code. Diffing on observed behaviors, rather than the code itself or a static graph of what APIs could call each other, allows Akita’s reports to pinpoint potential issues much more precisely. Akita is flexible enough to run in both CI, as shown here, and in production environments! You may be wondering how we do this, since no existing linter, static analysis, or dynamic analysis gets anywhere close. Under the hood, Akita analyzes API traffic to build a model of your API from scratch. As a foundation, Akita builds an API spec for your API. Here we show an OpenAPI spec that Akita automatically generated: The next step is where the magic comes from. Akita uses advanced programming languages technology to detect not just the basic spec properties, but also implicit API contracts, for instance, specific types like datetime, email, phone number, and more: All you have to do is set up Akita to watch your API traffic. No code changes and no proxying necessary: We’ve just released the spec viewer and GitHub integration in our private beta and would love to have you try out our spec generation and/or semantic API behavior diffs. Here’s how you can help: Sign up for our private beta if you’re interested in trying things out! You may also be interested in this talk and demo we gave at the API Specs Conference last week. We’re also constantly trying to make our tool better! If change analysis is an issue for you, please fill out this survey—with an opportunity to win a $50 Amazon gift card. Spread the word about us! We’d love all the feedback we can get! Copyright © 2020 Akita Software, Inc. All rights reserved.
	Share this with Email Facebook Messenger Messenger Twitter Pinterest WhatsApp LinkedIn Copy this link These are external links and will open in a new window She was an Indonesian domestic helper who earned S$600 (L345) a month working for an extremely wealthy Singaporean family. He was her employer, a titan of Singapore's business establishment and the chairman of some of the country's biggest companies. One day, his family accused her of stealing from them. They reported her to the police - triggering what would become a high-profile court case that would grip the country with its accusations of pilfered luxury handbags, a DVD player, and even claims of cross-dressing. Earlier this month, Parti Liyani was finally acquitted. "I'm so glad I'm finally free," she told reporters through an interpreter. "I've been fighting for four years." But her case has prompted questions about inequality and access to justice in Singapore, with many asking how she could have been found guilty in the first place. Ms Parti first began working in Mr Liew Mun Leong's home in 2007, where several family members including his son Karl lived. In March 2016, Mr Karl Liew and his family moved out of the home and lived elsewhere. Court documents that detail the sequence of events say that Ms Parti was asked to clean his new house and office on "multiple occasions" - which breaks local labour regulations, and which she had previously complained about. A few months later, the Liew family told Ms Parti she was fired, on the suspicion that she was stealing from them. But when Mr Karl Liew told Parti that her employment was terminated, she reportedly told him: "I know why. You are angry because I refused to clean up your toilet." She was given two hours to pack her belongings into several boxes which the family would ship to Indonesia. She flew back home on the same day. While packing, she threatened to complain to the Singapore authorities about being asked to clean Mr Karl Liew's house. The Liew family decided to check the boxes after Ms Parti's departure, and claimed they found items inside that belonged to them. Mr Liew Mun Leong and his son filed a police report on 30 October. Ms Parti said had no idea about this - until five weeks later when she flew to Singapore to seek new employment, and was arrested upon arrival. Unable to work as she was the subject of criminal proceedings, she stayed in a migrant workers' shelter and relied on them for financial assistance as the case dragged on. Ms Parti was accused of stealing various items from the Liews including 115 pieces of clothing, luxury handbags, a DVD player and a Gerald Genta watch. Altogether the items were said to be worth S$34,000. During the trial, she argued that these alleged stolen items were either her belongings, discarded objects that she found, or things that she had not packed into the boxes themselves. In 2019, a district judge found her guilty and sentenced her to two years and two months' jail. Ms Parti decided to appeal against the ruling. The case dragged on further until earlier this month when Singapore's High Court finally acquitted her. Justice Chan Seng Onn concluded the family had an "improper motive" in filing charges against her, but also flagged up several issues with how the police, the prosecutors and even the district judge had handled the case. He said there was reason to believe the Liew family had filed their police report against her to stop her from lodging a complaint about being illegally sent to clean Mr Karl Liew's house. The judge noted that many items that were allegedly stolen by Ms Parti were in fact already damaged - such as the watch which had a missing button-knob, and two iPhones that were not working - and said it was "unusual" to steal items that were mostly broken. In one instance, Ms Parti was accused of stealing a DVD player, which she said had been thrown away by the family because it did not work. Prosecutors later admitted they knew the machine could not play DVDs, but did not disclose this during the trial when it was produced as evidence and shown to have worked in another way. This earned criticism from Justice Chan that they used a "sleight-of-hand technique… [that] was particularly prejudicial to the accused". In addition, Justice Chan also questioned the credibility of Mr Karl Liew as a witness. The younger Mr Liew accused Ms Parti of stealing a pink knife which he allegedly bought in the UK and brought back to Singapore in 2002. But he later admitted the knife had a modern design that could not have been produced in Britain before 2002. He also claimed that various items of clothing, including women's clothes, found in Ms Parti's possession were actually his - but later could not remember if he owned some of them. When asked during the trial why he owned women's clothing, he said he liked to cross-dress - a claim that Justice Chan found "highly unbelievable". Justice Chan also questioned the actions taken by police - who did not visit or view the scene of the offences until about five weeks after the initial police report was made. The police also failed to offer her an interpreter who spoke Indonesian, and instead offered one who spoke Malay, a different language which Ms Parti was not used to speaking. "It was very worrying conduct by the police in the way they handled the investigations," Eugene Tan, Professor of Law at Singapore Management University told BBC News. "The district judge appeared to have prejudged the case and failed to pick out where the police and prosecutors fell short." The case has touched a nerve in Singapore where much of the outrage has centred on Mr Liew and his family. Many have perceived the case as an example of the rich and elite bullying the poor and powerless, and living by their own set of rules. Although justice ultimately prevailed, among some Singaporeans it has rattled a long-held belief in the fairness and impartiality of the system. "There hasn't been a case like this in recent memory," said Prof Tan. "The apparent systemic failures in this case have caused a public disquiet. The question that went through many people's minds were: What if I was in her shoes? Will it be fairly investigated… and judged impartially? That the Liews were able to have the police and the lower court fall for the false allegations have raised legitimate questions about whether the checks and balances were adequate." Following the public outcry, Mr Liew Mun Leong announced he was retiring from his position as chairman of several prestigious companies. In a statement, he said he "respected" the decision of the High Court and had faith in Singapore's legal system. But he also defended his decision to make a police report, saying: "I genuinely believed that if there were suspicions of wrongdoing, it is our civic duty to report the matter to the police". Mr Karl Liew has remained silent and has not released any statement on the matter. The case has triggered a review of police and prosecutorial processes. Law and Home Affairs Minister K Shanmugam admitted "something has gone wrong in the chain of events". What the government does next will be watched very closely. If it fails to address Singaporeans' demands for "greater accountability and systemic fairness", this may lead to "a gnawing perception that the elite puts its interests above that of society's," wrote Singapore commentator Donald Low in a recent essay. "The heart of the debate [is] whether elitism has seeped into the system and exposed a decay in our moral system," former journalist PN Balji said in a separate commentary. "If this is not addressed to satisfaction, then the work of the helper, lawyer, activists and judge will be wasted." The case has also highlighted the issue of migrant workers' access to justice. Ms Parti was able to stay in Singapore and fight her case due to the support of the non-governmental organisation Home, and lawyer Anil Balchandani, who acted pro bono but estimated his legal fees would have otherwise come up to S$150,000. Singapore does provide legal resources to migrant workers, but as they are usually their families' sole breadwinners, many of those who face legal action often decide not to fight their case, as they do not have the luxury of going for months if not years without income, according to Home. "Parti was represented steadfastly by her lawyer who… fought doggedly against the might of the state. The legal resource asymmetry was just so stark," said Prof Tan. "It was a David versus Goliath battle - with the Davids emerging triumphant." As for Ms Parti, she has said that she will now be returning home. "Now that my problems are gone, I want to return to Indonesia," she said in media interviews. "I forgive my employer. I just wish to tell them not to do the same thing to other workers." The six-month scheme sees the government pay a share of wages lost due to the coronavirus pandemic. Have you been getting these songs wrong? What happens to your body in extreme heat?

Frantically googling for some cleaning functions we discovered textclean package.

df_texts <-
  df_texts %>%
  mutate(texts =
           replace_non_ascii2(texts) %>%
           replace_html() %>%
           replace_misspelling() %>%
           replace_contraction())

df_texts

	texts <chr>
	Cal Paterson \| Home About September 2020 Mozilla is in an absolute state: high overheads, falling usage of Firefox, questionable sources of revenue and now making big cuts to engineering as their income falls. Mozilla recently announced that they would be dismissing 250 people. that is a quarter of their workforce so there are some deep cuts to their work too. The victims include: the Men docs ( those are the web standards docs everyone likes better than schools ), the Rust compiler and even some cuts to Firefox development. Like most people I want to see Mozilla do well but those three projects comprise pretty much what I think of as the whole point of Mozilla, so this news is a a big let down. The stated reason for the cuts is falling income. Mozilla largely relies on " royalties " for funding. In return for payment, Mozilla allows big technology companies to choose the default search engine in Firefox - the technology companies are ultimately paying to increase the number of searches Firefox users make with them. Mozilla haven't been particularly transparent about why these royalties are being reduced, except to blame the corona virus. I am sure the corona virus is not a great help but I suspect the bigger problem is that Firefox's market share is now a tiny fraction of its previous size and so the royalties will be smaller too - fewer users, so fewer searches and therefore less money for Mozilla. The real problem is not the royalty cuts, though. Mozilla has already received more than enough money to set themselves up for financial independence. Mozilla received up to half a billion dollars a year ( each year! ) for many years. The real problem is that Mozilla did not use that money to achieve financial independence and instead just spent it each year, doing the organizational equivalent of living hand - to - mouth. Despite their slightly contrived legal structure as a non - profit that owns a for - profit, Mozilla are an No just like any other. In this article I want to apply the traditional measures that are applied to other Nos to Mozilla in order to show what is wrong. These three measures are: overheads, ethics and results. One of the most popular and most intuitive ways to evaluate an No is to judge how much of their spending is on their programmer of works ( or " mission " ) and how much is on other things, like administration and fundraising. If you give money to a charity for feeding people in the third world you hope that most of the money you give them goes on food - and not, for example, on company cars for head office staff. Mozilla looks bad when considered in this light. Fully 30 % of all expenditure goes on administration. Charity Navigator, an organization that measures No effectiveness, would give them zero out of ten on the relevant metric. For context, to achieve 5 / 10 on that measure Mozilla admin would need to be under 25 % of spending and, for 10 / 10, under 15 %. Senior executives have also done very well for themselves. Mitchell Baker, Mozilla's top executive, was paid $2. 4 in 2018, a sum I personally think of as instant inter - generational wealth. Payments to Baker have more than doubled in the last five years. As far as I can find, there is no UK - based No whose top executive makes more than 1 ( $1. 3 ) a year. The UK certainly has its fair share of big international Nos - many much bigger and more significant than Mozilla. I am aware that some people dislike overheads as a measure and argue that it is possible for administration spending to increase effectiveness. I think it is hard to argue that Mozilla's overheads are correlated with any improvement in effectiveness. Mozilla now thinks of itself less as a custodian of the old Netscape suite and more as a privacy Gong. One slogan inside Mozilla is: " Beyond the Browser ". Regardless of how they view themselves, most of their income comes from helping to direct traffic to Google by making that search engine the default in Firefox. Google make money off that traffic via a big targeted advertising system that tracks people across the web and largely without their consent. Indeed, one of the reasons this income is falling is because as Firefox's usage falls less traffic is being directed Google's way and so Google will pay less. There is, as yet, no outbreak of agreement among the moral philosophers as to a universal code of ethics. However I think most people would recognize hypocrisy in Mozilla's relationship with Google. Beyond the ethical problems, the relationship certainly seems to create conflicts of interest. Anyone would think that a privacy No would build anti - tracking countermeasures into their browser right from the start. In fact, this was only added relatively recently ( in 2019 ), after both Apple ( in 2017 ) and Brave ( since release ) paved the way. It certainly seems like Mozilla's status as a Google vassal has played a role in the absence of anti - tracking features in Firefox for so long. Another ethical issue is Mozilla's big new initiative to move into Vans. This does not make a lot of sense from a privacy point of view. Broadly speaking: Vans are not a useful privacy tool for people browsing the web. A Van lets you access the internet through a proxy - so your requests superficially appear to come from somewhere other than they really do. This does nothing to address the main privacy problem for web users: that they are being passively tracked and DE - anonymity on a massive scale by the baddies at Google and elsewhere. This tracking happens regardless of IP address. When I tested Firefox through Mozilla Van ( a re brand of Mullah Van ) I found that I could be DE - anonymity by browser fingerprinting - already a fairly widespread technique by which various elements of your browser are examined to create a " fingerprint " which can then be used to re - identify you later. Firefox does not include as many countermeasures against this as some other browsers ( this is a correction - I previously said Firefox contained none but it is been pointed out to me that since earlier this year it does block some kinds of fingerprinting ). Another worry is that many of these privacy focused Van services have a nasty habit of turning out to keep copious logs on user behavior. A few months ago several " no log " Van services inadvertently released terabytes of private user data that they had promised not to collect in a massive breach. Van services are in a great position to eavesdrop - and even if they promise not to, your only option is to take them at their word. I have discussed the Mozilla chair's impressive pay: $2. 4 / year. Surely such impressive pay is justified by the equally impressive results Mozilla has achieved? Sadly on almost every measure of results both quantitative and qualitative, Mozilla is a dog. Firefox is now so niche it is in danger of garnering a cult following: it has just 4 % market share, down from 30 % a decade ago. Mobile browsing numbers are bleak: Firefox barely exists on phones, with a market share of less than half a percent. This is baffling given that mobile Firefox has a rare feature for a mobile browser: it is able to install extensions and so can block ads. Yet despite the problems within their core business, Mozilla, instead of retrenching, has diversified rapidly. In recent years Mozilla has created: Many of the above are now abandoned. Sadly Mozilla's annual report does not break down expenses on a per - project basis so it is impossible to know how much of the spending that is on Mozilla's programmer is being spent on Firefox and how much is being spent on all these other side - projects. What you can at least infer is that the side - projects are expensive. Software development always is. Each of the projects named above ( and all the other ones that were never announced or that I do not know about ) will have required business analysts, designers, user researchers, developers, testers and all the other people you need in order to create a consumer web project. The biggest cost of course is the opportunity cost of just spending that money on other stuff - or nothing: it could have been invested to build an endowment. Now Mozilla is in the situation where apparently there is not enough money left to fully fund Firefox development. Mozilla can not just continue as before. At the very least they need to reduce their expenses to go along with their now reduced income. That income is probably still pretty enormous though: likely hundreds of millions a year. I am a Firefox user ( and one of the few on mobile, apparently ) and I want to see Mozilla succeed. As such, I would hope that Mozilla would cut their cost of administration. I would also hope that they would increase spending on Firefox to make it faster and implement those privacy features that other browsers have. Most importantly: I would like them to start building proper financial independence. I doubt those things will happen. Instead they will likely keep the expensive management. They have already cut spending on Firefox. Their great hope is to continue trying new things, like using their brand to sell Van services that, as I have discussed, do not solve the problem that their users have. Instead of diversifying into yet more products and services Mozilla should probably just ask their users for money. For many years the Guardian newspaper ( a similarly sized organization to Mozilla in terms of staff ) was a financial basket case. The Guardian started asking their readers for money a few years ago and seems to be on firmer financial footing since. Getting money directly has also helped align the incentives of their organization with those of their readers. Perhaps that would work for Mozilla. But then, things are different at the Guardian. Their chief exec makes a mere 360, 000 a year. Please do feel free to send me an email about this article, especially if you disagreed with it. If you liked it, you might like other things I have written. I write a new article roughly once a month. You can get notified when I write something new by email alert or by Rs feed. Another notable technology No with huge expenses is The Wiki media foundation. You can read more about their problems in Wikipedia has Cancer which uses a very dodgy metaphor but is otherwise a good study of incredible spending growth there. Wiki media has the advantage over Mozilla that all contributors to its main project are unpaid - which only makes the expenses all the more mind blowing. Wiki media, like Mozilla, has had a lot of side projects. At one point they were trying, in secret, to build a whole new search engine. A good example of a charity that shut down due to poor expense control is Kids Company, a British charity for helping vulnerable children. The Kids Company debacle shows that shutdown can be quite abrupt. The British government provided most of the funding but ( eventually ) lost faith and ceased their annual grant which resulted in the charity being immediately liquidated. A House of Commons committee released a good report. Mozilla's legal structure as a non - profit that owns a for - profit arm makes their government filings ( " Form 990 " s ) of pretty limited use as they are not consolidated and do not contain any information about the majority of Mozilla activities. The most useful thing about these documents is that people who are on the board of the parent have to disclose pay from the subsidiary ( in Part IX, generally on page 10 ). Publication have a complete set. More useful is Mozilla's annual report, which is consolidated. I haven't been able to find a master index of these but you can get there via the " State of Mozilla " pages - you will need to edit the year in the URL to get previous annual reports. I used stat counter's usage data. I haven't included any 2020 data as the year is not out yet but it is even worse than 2019 for Firefox.
	By choosing I agree below, you agree that NPR sites use cookies, similar tracking and storage technologies, and information about the device you use to access our sites to enhance your viewing, listening and user experience, personalize content, personalize messages from NPR sponsors, provide social media features, and analyze NPR traffic. This information is shared with social media services, sponsorship, analytic and other third - party service providers. See details. NPR Terms of Use and Privacy Policy.
	Should you join a big company or start a startup? This frequently debated question paints a picture of a world where the only choice is between being a cog at a giant semi - monopoly, or taking investment money in the hopes of one day growing to be head cog at a giant semi - monopoly. Role models matter. So I made a list of small companies that I admire. Neither giants nor startups - just people making a living writing software on their own terms. Sq lite is an existence proof that you do not need a thousand engineers to have an impact. it is been around since the early 90s, is one of the most shipped pieces of software in existence and is widely respected for its exceptional testing methodology. Yet it was written almost entirely by a few people: Their growth plans: Thwack is a small company but it is also closely held and debt - free and has low fixed costs, which means that it is largely immune to buy - outs, take - overs, and market down - turns. Thwack intends to continue operating in its current form, and at roughly its current size until at least the year 2050. We expect to be here when you need us, even if that need is many years in the future. Sq lite is also interesting because although the code is open source they operate on a cathedral model, generally refusing or rewriting outside contributions. Inboard is a single - person operation that has been making a solid income year on year since 2010. It somehow still leaves Emaciate enough free time to go around teaching politicians how to not get hacked and explaining privacy to the senate. All the while destroying his competition using highly advanced tactics such as listening to his customers and not operating at a loss. But if Emaciate had done nothing else he would still be on this list for his beautiful talk Barely Succeed! it is Easier! We live in a remarkable time when small teams ( or even lone programmers ) can successfully compete against internet giants. But while the last few years have seen an explosion of product ideas, there has been far less innovation in how to actually build a business. Silicon Valley is stuck in an outdated grow or die mentality that overvalues risk, while investors dismiss sustainable, interesting projects for being too practical. So who needs investors anyway? I will talk about some alternative definitions of success that are more achievable ( and more fun! ) than the Silicon Valley casino. It turns out that staying small offers some surprising advantages, not just in the day - to - day experience of work, but in marketing and getting customers to love your project. Best of all, there is plenty more room at the bottom. If your goal is to do meaningful work you love, you may be much closer to realizing your dreams than you think. Another single - person operation, tar snap has reported been profitable since 2009 and shows no sign of going anywhere, even lowering it is prices over the years as storage became cheaper. it is saved me on multiple occasions and it is so simple, reliable and cheap that I think of it almost like a utility - tar snap having an outage would be more surprising to me than water not coming out of my taps. One of the most interesting things about tar snap is that it could easily be making more money. If tar snap had investors or shareholders, they would have already forced Colin to do so. In other words, the only reason that Colin gets to continue running the kind of business that he wants to run is that he owns it entirely and noon can force him to maximize profit instead. Version 1. 0 came out in 2008 and they are still going strong today, making enough to hire an extra developer and launch a separate product. Sublime is notable for surviving despite massive open - source competition in a field where people hate to pay for tools. As Tristan Hume points out, it is really the only editor that has managed to remain both fast and consistent while still supporting a large plugin ecosystem. These kind of end - to - end qualities are perhaps one of the main advantages for small teams with a long time horizon. Zing has been around only a few years but I am already wildly impressed by the quality of thought and engineering that has gone into it. The Zing Software Foundation is a non - profit, currently funded entirely by donations and working towards being able to employ a second full - time developer. They explicitly chose to be a non - profit out of concern over how profit incentives creep into daily decision making: In You were not Meant to Have a Boss, Paul Graham makes an analogy between animals in the zoo ( employees ) and animals in the wild ( startup founders ). I think hes on to something, but when you start a startup, you still have a boss. In fact, you have the same boss. At the end of the day, it is an inescapable fact that you must do what makes a profit, for the shareholders. The difference in incentives makes a structural difference that permeates every part of an organization. Windows users wake up one day and find ads in their start menu. Will Debian Linux ever try to put ads into any of its software? The concept is absurd. Zing was explicitly designed to be a small, simple language, allowing it to be maintained indefinitely by a small, committed team. Sf is a small organization and makes efficient use of monetary resources. I font see the need for Sf to grow beyond a handful of people. As Emaciate puts it, one of the advantages of being small is that you can speak with a human voice. This comes across clearly from the zing team, who are resolutely human instead of projecting some impassive facade of professionalism.
	Language & Linguistics, New York, USA You know how to have a polite conversation, right? You listen, wait for a pause, say your bit, then shut up so someone else can speak. In other words, you take your turn. Your obviously not from New York. To an outsider, someone from, say, Toronto or Seattle or London, a conversation among New Corkers may resemble a verbal wrestling match. Everyone seems to talk at once, butting in with questions and comments, being loud, rude and aggressive. Actually, according to the American linguist E J White, there just being nice. When they talk simultaneously, raise the volume and insert commentary ( I knew he was trouble, I hate that! ), New Corkers rent trying to hijack the conversation, White says. There using cooperative overlap, contextualization cues ( like vocal pitch ) and cooperative interruption to keep the talk perking merrily along. To them, argument is engagement, loudness is enthusiasm and interruption means there listening, she writes. Behavior that would stop a conversation dead in Milwaukee nudges it forward in New York. Why do New Corkers talk this way? Perhaps, White says, because its the cultural norm among many of the communities that have come to make up the city: eastern European Jews, Italians, and Puerto Pelicans and other Spanish speakers. As for the famous New York accent, thatch something else again. White, who teaches the history of language at Stony Brook University, New York, argues that Americans sound the way they do because New Corkers sound the way they do. In You Talking to Me? she makes a convincing case that the sounds of standard American English developed, at least in part, as a backlash against immigration and the accent of New York. Although the book is aimed at general readers, its based on up - to - the - minute research in the relatively new field of historical linguistically. ( Here a New Corker would helpfully interrupt, Yeah, which is what? ) Briefly, it is about how and why language changes. Its central premise is that things like social class, gender, age, group identity and notions of prestige, all working in particular historical settings, are what drive change. Take one of the sounds typically associated with New York speech the oi thatch heard when bird is pronounced void, earl oil, certainly hesitantly, and so on. Hairs a surprise. That oi, White says, was a marker of upper - class speech in old New York, a prestige pronunciation used by Teddy Roosevelt and the Four Hundred who rubbed elbows in Mrs Assort ballroom. Hairs another surprise. The pronunciation is now defunct and exists only as a stereotype. It retired from high society after the First World War and by mid - century it was no longer part of New York speech in general. Yet for decades afterwards it persisted in sitcoms, cartoons and the like. Although extinct in the wild ( as linguists like to say ), it lives on in a mythological New York City of the mind. Another feature of New York speech, one that survives today, though its weakening, is the dropping of r after a vowel in words like four ( foal ), park ( pah ) and never ( nevus ). This was also considered a prestige pronunciation in the early 1900s, White says, not just in New York City but in much of New England and the South as well, where it was valued for its resemblance to cultivated British speech. Until sometime in the 1950s, in fact, it was considered part of what elocutionists used to call General American. It was taught, the author writes, not only to schoolchildren on the East Coast, but also to aspiring actors, public speakers and social climbers nationwide. But here, too, change lay ahead. While r - dropping is still heard in New York, Boston and pockets along the Eastern Seaboard, it has all but vanished in the South and was never adopted in the rest of the United States. Here the author deftly unravels an intriguing mystery: why the most important city in the nation, its center of cultural and economic power, does not provide, as is the case with other countries, the standard model for its speech. To begin with, White reminds us, the original Americans always pronounced r, as the British did in colonial times. Only in the late 18th century did the British stop pronouncing r after a vowel. Not surprisingly, the colonists who remained in the big East Coast seaports and had regular contact with London adopted the new British pronunciation. But those who settled inland retained the old r and never lost it. ( As White says, this means that Speakerphones accent was probably more like standard American today than Received Pronunciation. ) Posh eastern universities also helped to turn the nations accent westward. Towards the end of the First World War, White says, Ivy League schools fretted that swelling numbers of Jewish students, admitted on merit alone, would discourage enrollment from the Protestant upper class. Admissions practices changed. In the 1920s, elite schools began to recruit students from outside New Yonks orbit and to ask searching questions about race, religion, color and heritage. The result, White says, was that upper - crust institutions shifted their preference for prestige pronunciation toward the purer regions of the West and the Midwest, where Protestants of Nordic descent were more likely to live. Thus notions about what constituted educated American speech gradually shifted. Another influence, the author writes, was the Midwestern - sounding radio and television network English that was inspired by the Second World War reporting of Edward R Murrow and the Murrow Boys he recruited to CBS from the nations interior. Marrows eloquent, authoritative reports, heard by millions, influenced generations of broadcasters, including Walter Cronkite, Set Huntley and Dan Rather, who dint try to sound like they had grown up on the Eastern Seaboard. The voice of the Midwest became the voice of America. This book takes in a lot of territory, all solidly researched and footnoted. But dry? Roundabout. White is particularly entertaining when she discusses underworld slang from the city sensitive lines of business and shes also good on song lyrics, from Tin Pan Alley days to hip - hop. She dwells lovingly on the sharp, smart, swift, and sure lyrics of the golden age of American popular music roughly, the first half of the 20th century. It was a time when New York lyricists, nearly all of them Jewish, preserved in the American Songbook not only the vernacular of the Lower East Side but also the colloquialisms of Harlem and the snappy patois of advertising. You Talking to Me? is engrossing and often funny. In dissecting the exaggerated New York accents of Bugs Bunny and Grouch Marx, White observes that Bugs even wielded his carrot like Grouches cigar. And she says that the word fuck is so ubiquitous in Gotham that it has lost its edge, so a New Corker in need of a blistering insult must look elsewhere. There may be some truth to the old joke that in Loci Angeles, people say Have a nice day and mean Fuck off, while in New York, people say Fuck off and mean Have a nice day. Only a selection of our reviews and articles are free. Subscribers receive the monthly magazine and access to all articles on our website. Follow Literary Review on Twitter this book takes in a lot of territory, all solidly researched and footnoted. But dry? Roundabout. patrician T O'Connor on E J White's you Talking To Me? The Unruly History of New York Glisten. HTTP: / / literary review. co. UK / task - of - the - town the identification of a mighty force sparkling intermittently seems to me to constitute the finest and most consistent poetic achievement of Goodies book. ' Candida Milligram on @ landholding's the Story of Scottish Art. HTTP: / / literary review. co. UK / all - that - glisters Though the hotel had a reputation as the areas best, its staff were not used to looking after world leaders, so the arrival of Cubs new strongman, Filed Castor, came as something of a shock. ' @ donnybrook on @ teletypewriter's ten Days in Harlequin. HTTP: / / ow. l / I63850Bvs10 Designed and built by SAM Oakley
	If you work on web apps, you may sometimes feel that your tools have forsaken you. And your not wrong. Because of the way modern tech stacks are set up, its getting harder and harder to make sure your code is doing what its supposed to. In this blog post, well show you a bug that slips through the cracks today, explain why its nobody fault, and show you how we diff API behaviors to catch bugs ( while generating API specs along the way!! ). You work on a web app. Your co - worker and # BFF, Ski, opened a pull request to remove a property from a commonly used API. You two are tight, so you jump in to review the PR. Skis PR includes a link showing that this property usage has been removed from the website, once again proving why Ski is your # BFF. You +1 the PR enthusiastically, with both of you believing that Ski has taken into account all of the dependencies. The change gets merged. Then: BETRAYAL! There are now outages in the mobile clients, and its late nights for you and Ski. ( Good thing you started out liking Ski so much. ) It turns out that Skis API had been so good that the mobile Pis had started depending on the removed property, unbeknownst to both you and Ski. If this has happened to you, you are not to blame! These bugs are nasty little things: they have to do with dependencies on the code you wrote rather than the code itself, making it hard to catch them through source diffs. In fact, as soon as bugs cross API boundaries, they become extremely hard to catch because they elude common testing and software analysis ( think static and dynamic analysis ) solutions. Developers are left to rely on documentation, word of mouth, and crossing their fingers really hard. To catch these kinds of bugs, you can simply install Akita to analyze API - impacting changes whenever you make a pull request. Here is an example of a comment that Akita leaves on your pull request: Akita enables you to compare API behavior across pull requests, multiple environments ( ex production vs. test ), and even per - existing specs. If Ski had used Akita to compare the API behavior between their production and test environments, they surely would have caught the bug and saved you some late - night debugging! By diffing on behaviors rather than code, Akita summarizes how a pull request changes your API, including: Endpoints added that expose new functionality Removed / changed parameters that may break existing clients Simple type changes ( ex string to int ) that can pollute data pipelines Complex type changes ( ex phone number to date time ) that can break dependencies Coming soon: Impacted Clients, telling you exactly which mobile and web clients are affected Just this week, we released our Git hub integration to deliver insight on every pull request. You can now use Akita without changing any code or con fig files, without having to proxy and as part of your normal developer workflow. Akita does this by showing you semantic diffs between implicit API contracts. In other words, this means Akita does some magic to figure out what your API normally does and then diffs on that, instead of diffing syntactically on the source code. Diffing on observed behaviors, rather than the code itself or a static graph of what Pis could call each other, allows Fajitas reports to pinpoint potential issues much more precisely. Akita is flexible enough to run in both S, as shown here, and in production environments! You may be wondering how we do this, since no existing liner, static analysis, or dynamic analysis gets anywhere close. Under the hood, Akita analyzes API traffic to build a model of your API from scratch. As a foundation, Akita builds an API spec for your API. Here we show an Opening spec that Akita automatically generated: The next step is where the magic comes from. Akita uses advanced programming languages technology to detect not just the basic spec properties, but also implicit API contracts, for instance, specific types like date time, email, phone number, and more: All you have to do is set up Akita to watch your API traffic. No code changes and no propping necessary: Wee just released the spec viewer and Git hub integration in our private beta and would love to have you try out our spec generation and / or semantic API behavior diffs. Hairs how you can help: Sign up for our private beta if your interested in trying things out! You may also be interested in this talk and demo we gave at the API Specs Conference last week. Were also constantly trying to make our tool better! If change analysis is an issue for you, please fill out this survey with an opportunity to win a $50 Amazon gift card. Spread the word about us! Wed love all the feedback we can get! Copyright 2020Akita Software, Inc. All rights reserved.
	Share this with Email Facebook Messenger Messenger Twitter Interest Whats app Linked in Copy this link These are external links and will open in a new window She was an Indonesian domestic helper who earned S$600 ( 345 ) a month working for an extremely wealthy Singaporean family. He was her employer, a titan of Singapore's business establishment and the chairman of some of the country's biggest companies. One day, his family accused her of stealing from them. They reported her to the police - triggering what would become a high - profile court case that would grip the country with its accusations of pilfered luxury handbags, a DVD player, and even claims of cross - dressing. Earlier this month, Parch Aliyah was finally acquitted. " I am so glad I am finally free, " she told reporters through an interpreter. " I have been fighting for four years. " But her case has prompted questions about inequality and access to justice in Singapore, with many asking how she could have been found guilty in the first place. Ms Parch first began working in Rm Lie Mun Long's home in 2007, where several family members including his son Lark lived. In March 2016, Rm Lark Lie and his family moved out of the home and lived elsewhere. Court documents that detail the sequence of events say that Ms Parch was asked to clean his new house and office on " multiple occasions " - which breaks local labor regulations, and which she had previously complained about. A few months later, the Lie family told Ms Parch she was fired, on the suspicion that she was stealing from them. But when Rm Lark Lie told Parch that her employment was terminated, she reportedly told him: " I know why. You are angry because I refused to clean up your toilet. " She was given two hours to pack her belongings into several boxes which the family would ship to Indonesia. She flew back home on the same day. While packing, she threatened to complain to the Singapore authorities about being asked to clean Rm Lark Lie's house. The Lie family decided to check the boxes after Ms Part's departure, and claimed they found items inside that belonged to them. Rm Lie Mun Long and his son filed a police report on 30 October. Ms Parch said had no idea about this - until five weeks later when she flew to Singapore to seek new employment, and was arrested upon arrival. Unable to work as she was the subject of criminal proceedings, she stayed in a migrant worker's shelter and relied on them for financial assistance as the case dragged on. Ms Parch was accused of stealing various items from the Lies including 115 pieces of clothing, luxury handbags, a DVD player and a Gerald Gents watch. Altogether the items were said to be worth S$34, 000. During the trial, she argued that these alleged stolen items were either her belongings, discarded objects that she found, or things that she had not packed into the boxes themselves. In 2019, a district judge found her guilty and sentenced her to two years and two month's jail. Ms Parch decided to appeal against the ruling. The case dragged on further until earlier this month when Singapore's High Court finally acquitted her. Justice Chan Gens Non concluded the family had an " improper motive " in filing charges against her, but also flagged up several issues with how the police, the prosecutors and even the district judge had handled the case. He said there was reason to believe the Lie family had filed their police report against her to stop her from lodging a complaint about being illegally sent to clean Rm Lark Lie's house. The judge noted that many items that were allegedly stolen by Ms Parch were in fact already damaged - such as the watch which had a missing button - knob, and two phones that were not working - and said it was " unusual " to steal items that were mostly broken. In one instance, Ms Parch was accused of stealing a DVD player, which she said had been thrown away by the family because it did not work. Prosecutors later admitted they knew the machine could not play Dds, but did not disclose this during the trial when it was produced as evidence and shown to have worked in another way. This earned criticism from Justice Chan that they used a " sleight - of - hand technique [ that ] was particularly prejudicial to the accused ". In addition, Justice Chan also questioned the credibility of Rm Lark Lie as a witness. The younger Rm Lie accused Ms Parch of stealing a pink knife which he allegedly bought in the UK and brought back to Singapore in 2002. But he later admitted the knife had a modern design that could not have been produced in Britain before 2002. He also claimed that various items of clothing, including women's clothes, found in Ms Part's possession were actually his - but later could not remember if he owned some of them. When asked during the trial why he owned women's clothing, he said he liked to cross - dress - a claim that Justice Chan found " highly unbelievable ". Justice Chan also questioned the actions taken by police - who did not visit or view the scene of the offenses until about five weeks after the initial police report was made. The police also failed to offer her an interpreter who spoke Indonesian, and instead offered one who spoke Malay, a different language which Ms Parch was not used to speaking. " It was very worrying conduct by the police in the way they handled the investigations, " Eugene Tan, Professor of Law at Singapore Management University told BBC News. " The district judge appeared to have prejudged the case and failed to pick out where the police and prosecutors fell short. " The case has touched a nerve in Singapore where much of the outrage has centered on Rm Lie and his family. Many have perceived the case as an example of the rich and elite bullying the poor and powerless, and living by their own set of rules. Although justice ultimately prevailed, among some Singaporeans it has rattled a long - held belief in the fairness and impartiality of the system. " There has not been a case like this in recent memory, " said Prof Tan. " The apparent systemic failures in this case have caused a public disquiet. The question that went through many people's minds were: What if I was in her shoes? Will it be fairly investigated and judged impartially? That the Lies were able to have the police and the lower court fall for the false allegations have raised legitimate questions about whether the checks and balances were adequate. " Following the public outcry, Rm Lie Mun Long announced he was retiring from his position as chairman of several prestigious companies. In a statement, he said he " respected " the decision of the High Court and had faith in Singapore's legal system. But he also defended his decision to make a police report, saying: " I genuinely believed that if there were suspicions of wrongdoing, it is our civic duty to report the matter to the police ". Rm Lark Lie has remained silent and has not released any statement on the matter. The case has triggered a review of police and prosecutor processes. Law and Home Affairs Minister K Shamanism admitted " something has gone wrong in the chain of events ". What the government does next will be watched very closely. If it fails to address Incorporeal demands for " greater accountability and systemic fairness ", this may lead to " a gnawing perception that the elite puts its interests above that of society's, " wrote Singapore commentator Donald Low in a recent essay. " The heart of the debate [ is ] whether elitism has seeped into the system and exposed a decay in our moral system, " former journalist On Ball said in a separate commentary. " If this is not addressed to satisfaction, then the work of the helper, lawyer, activists and judge will be wasted. " The case has also highlighted the issue of migrant worker's access to justice. Ms Parch was able to stay in Singapore and fight her case due to the support of the non - governmental organization Home, and lawyer Nail Bacchanalian, who acted pro boon but estimated his legal fees would have otherwise come up to S$150, 000. Singapore does provide legal resources to migrant workers, but as they are usually their families sole breadwinners, many of those who face legal action often decide not to fight their case, as they do not have the luxury of going for months if not years without income, according to Home. " Parch was represented steadfastly by her lawyer who fought doggedly against the might of the state. The legal resource asymmetry was just so stark, " said Prof Tan. " It was a David versus Goliath battle - with the Davids emerging triumphant. " As for Ms Parch, she has said that she will now be returning home. " Now that my problems are gone, I want to return to Indonesia, " she said in media interviews. " I forgive my employer. I just wish to tell them not to do the same thing to other workers. " The six - month scheme sees the government pay a share of wages lost due to the corona virus pandemic. Have you been getting these songs wrong? What happens to your body in extreme heat?
	Twitter is a powerful medium for sketching: a tool for fluidly developing ideas in mealtime. What can we learn from its design? I think it shows us the power of 1 ) the right constraints, 2 ) low barriers to starting and finishing, and 3 ) a social context. Writing shard. Like most aspiring bloggers, my folders of drafts and my dreams of future prolifically outweigh my actual output. Vie found a curious trick for getting over this hurdle, though: writing tweet threads. Vie published many little bursts of tweets about topics Mi curious about: These are exactly the kinds of things Id like to blog about! But somehow, Vie found it 10 times easier to publish the tweet threads. I can hear you groaning already. Of course tweeting is easier than writing, you dummy! Our minds are being driven into the meat grinder 280 characters at a time, as we replace deep logical thought with aphorisms and memes. Twitter is Power point thinking on steroids. But I think this dismissive response misses the point. We cant really understand Twitter by treating it as a mediocre replacement for essays and research papers. We need to see it as a new medium on its own terms. In particular, Twitter is a medium for sketching for playing with ideas, on the fly. Twitter is more similar to scribbling on a whiteboard or tossing ideas around at the cafe than writing a book. ( By sketching I font mean literally just drawing; I mean any lightweight early expression of a thought. ) Why does Twitter work so well for this? Here are some of my theories: By reflecting on these properties, I think we can gain some insight not just into Tweeter specifically, but also the broader landscape of tools for thinking. Lets dive in. Thinking about big new things is hard, and our brains are good at finding ways to weasel out of the job by finding something easier to do, but still plausibly productive. Unfortunately, in the early stages of sketching out an idea, such distractions abound: worrying about word choice in the last paragraph instead of writing the next one, futzing with the font size, making a new blog system instead of writing the damn blog post. We can try to avoid these temptations, but an easier route is to simply find tools that font allow have the temptations in the first place. This is a key property of good sketching tools: they provide the right constraints. Lets examine a few of Twitters valuable constraints. Start with the obvious one: the 280 character limit. Twitters main constraint is encouraging concision. Its hard to dwell on word choice when you have so little space to work with. Twitters conversational tone also helps here can just write like I talk, and any fancy words would seem out of place. And of course, I cant tweak fonts and margins, which cuts off a distraction vector. But threads complicate the story of character limits a bit. The limit inst really that your entire point must fit into one twee tits that each of your individual points must squeeze under the limit. This provides a different useful constraint: each idea has to be wrapped in a little atomic package. I find this helpful for figuring out the boundaries between my thoughts and clarifying the discrete units of an argument. That constraint sort of resembles the benefits of an outlining tool. But Twitter has another constraint: a thread is linear! No indenting allowed. This forces a brisk straight line through the argument, instead of getting mired in the fine points of the sub - sub - sub - arguments of the first idea. Very limiting, but simultaneously freeing. Taken together, these constraints frame the pros and cons of the medium, its appropriate range of usage. Obviously, writing a book in a single - level outline would be foolish, but it works for a rough sketch. More interestingly, I think Twitter is useless for persuading a skeptical reader; theirs simply not space for providing enough detail and context. This is a common property of media for sketching: the initial lockup inst impressive enough to sway a user, even if its a useful tool for the internal team. I prefer to use Twitter as a way to workshop ideas with sympathetic parties who already have enough context to share my excitement about the ideas. Perhaps theirs a general principle here: Twitter is good for sketching ideas for the same reasons its bad for fully developing them. You cant accidentally start writing a book in Twitter, and thatch kind of the point. In general, what are the right constraints for a sketching tool? I think this question is deeper than it seems at first glance. You might say something like only offer the minimum fidelity needed to convey the point, but I think its not obvious how to define that minimum level. Sketching with a fat marker can prevent us from getting too detailed with our drawings; Muse, which Vie been using for iPad sketching recently, intentionally limits your ink choices to just a few colors. This works great for certain kinds of thinking and lockups. But for designing new interactions with animation and physics, we need a totally different class of tools with more capabilities! The line between essential and spurious depends on the goal. Providing the right constraints inst always a matter of removing. It can require adding advanced capabilities too, like this typeface design tool that uses fancy machine learning to provide a few simple knobs for controlling things like bold. It does not let you move individual vector points, but instead lets you operate at a more natural level of abstraction. If your not careful, constraints can easily damage fluidity's Enlargement showed, tying a brick to a pencil does not yield a productive tool. Overall, it seems that we want constraints that help keep us on track with fluid thought, but font rule out too many interesting possibilities. Considering both of these criteria together is a subtle balancing act, and I font see easy answers. Theirs a low barrier to starting on Twitter. Just click a button, type a thought, no need to spend a minute remembering how to start my blog server. Often, that first minute of friction is enough to prevent me from getting into the flow of writing. But the more interesting phenomenon is the low barrier to finishing. On Twitter, a single sentence is a completely acceptable unit of publication. Anything beyond that is sort of a bonus. In contrast, most of my blog posts go unpublished because I fear there not complete, or not good enough in some dimension. These unpublished drafts are obviously far more complete than a single tweet, but because there on a blog, they font feel done, and its hard to overcome the fear of sharing. This seems like a crucial part of sketching tools: when you make a sketch, it should be understood that your idea is immature, and feel safe to share it in that state. Theirs a time and a place for polished, deeply thorough artifacts and its not Twitter! Everyone knows you just did a quick sketch. I believe that quantity leads to quality. The students who make more pots in ceramics class improve faster than the students who obsess over making a single perfect pot. A tool with a built - in low barrier to finishing makes it easier to overcome the fear, do more work, and share it at an earlier stage. In my experience, sketching always requires a delicate dance between individual thought and collaboration. You sketch to clarify something for yourself, but also to communicate with others. I think a good sketching medium should account for both halves of the process. Writing a blog can feel like a lonely one - way mirror: release something into the world, maybe get a few comments back and some Hacker News snark. In contrast, Twitter is a bazaar, buzzing with activity. The engagement ratio is totally different. You can easily have micro - conversations around individual points in a thread. When the same people start showing up time after time, it starts to feel like seeing acquaintances in a village. On Twitter, I write for my Twitter friends, not for some amorphous crowd. At its best, this engagement leads to the kind of back - and - forth that characterizes my favorite kinds of sketching sessions. Ideas are in the air, its not clear where they came from really, they combine to form new ones in mealtime. For me, Twitter does an oddly good job at simulating the thrilling creative energy of a white boarding session. People pop in and out of the conversation offering insights; trees and sub - trees form riffing off of earlier points. Of course, my feeling of safety here presumes healthy engagement from other parties, a privilege not enjoyed by all. I suppose its kind of odd that such a globally public medium is suitable at all for sketching it seems only possible because Vie found safe, trusting mini - communities, defined by informal and permeable boundaries. Perhaps a more private Twitter would be even better for sketching, although it might cut out new people from entering the conversation? In an attempt to sketch more in the blog medium, I just whipped up this blog post in a couple hours, so I font really have a grand conclusion. And yet Mi still hitting publish! Mi curious to think more about the constraints / freedoms afforded by different kinds of creative tools, and whether we could get more clever with those constraints to enable new kinds of sketching. Mi especially curious about kinds of sketching which are only possible thanks to computers, and could not have been done with paper and pen. Paper and pen are great tools, but what else is out there? I guess Ill try to keep sketching more freely on the blog about ideas like this. Click one of the subscribe links below if yous like to see my future writing on this sort of thing. And if you have thoughts to share, send me an email or a tweet! If you enjoyed this post, here are some people who have written / spoken about related topics in much greater depth and eloquence than I have here:
	Bangor Daily News Amine news, sports, politics, election results, and obituaries Augusta, Amine Ranked - choice voting will be used in the presidential election in Amine this fall after the states high court ruled that Republicans did not gather enough valid signatures to put forward a peoples veto effort challenging the voting method. The decision, coming after months of legal challenges and an initial ruling from the Amine Supreme Judicial Court earlier this month that allowed the state to begin printing ballots, is a victory for Secretary of State Matt Dunlap, a Democrat whose office twice determined that opponents of ranked - choice voting failed to get enough signatures. It marks perhaps the last twist in a tumultuous effort and brings a historic election this fall, when Amine will be the first state to use ranked - choice voting in a presidential election. Proponents of the referendum faced a shorter signature - gathering season due a shortened legislative session due to the corona virus pandemic. The long legal battle over the validity of hundreds of signatures prevented the effort from switching into campaign mode. If Republicans had gathered enough legal signatures, it would have been the third time ranked - choice voting went to voters. Amine GOP Chair Dime Bouzoukis said the group was disappointed in the decision and exploring potential review options through the federal courts. The case hinged on 988 signatures collected by two people who were not registered to vote in the towns they lived in prior to circulating petitions. Dunlap argued that should invalidate the signatures they helped gather, while Republicans charged that the secretary of state was wrongly disenfranchising voters. A lower - court judge sided with Republicans in August. A panel of five judges on the high court disagreed in a 24 - page ruling issued Tuesday, saying the requirement that circulatory be registered voters in their towns while circulating sets only reasonable limits on petitioners. Further, they said being registered to vote is an easy way for the state to determine the validity of the circulatory petitions and vital to quick review. We conclude that the governments interest is sufficient to justify the restriction that the requirement places on petitioners First Amendment rights, the judges said. The decision comes nearly two weeks before Amine is set to send out absentee ballots to many voters. Overseas and military voters were required to get their ballots by last week. If the court upheld the lower courts ruling, it would have delayed the laws implementation until the June 2021 election because of the tight time frame for a ruling. Dunlap said the ruling comes as a great relief because it will help the state avoid the complications, confusion and expense that would have come from reprinting and reissuing ballots. It is a major blow to state and national Republicans, who sank over a half - million dollars into the referendum effort. Republicans have been opposed to ranked - choice voting since it first passed in Amine in 2016 and have led other legal challenges against it up until this year. All have failed. It is hard to say whether the decision will affect the presidential race in Amine. Former Vice President Joe Biden, the Democratic nominee. has held a large lead over President Donald Trump in most statewide polls thus far and has been tied in the swing 2nd Congressional District. Also on the ballot will be three third party candidates Bowie Hawkins of the Green Party, Jew Nonsense of the Libertarian Party and Rocky DE la Entente of the Alliance Party but they have failed to gain much traction in state or national polls so far. Correction: A previous version of this report misstated the requirements for being a circulatory under the Amine Constitution.
	To revisit this article, visit My Profile, then View saved stories. To revisit this article, visit My Profile, then View saved stories. Paul Ford To revisit this article, visit My Profile, then View saved stories. To revisit this article, visit My Profile, then View saved stories. When I go to the doctor, they ask what I do, and when I tell them, they start complaining to me about the software at the hospital. I love this, because I hate going to the doctor, and it gives us something to talk about besides my blood pressure. This is a pattern in my life: When I am asking at the library reference desk, chatting with the construction contractor with her iPad, or applying for a loan at the bank, I just peer over their shoulder a bit while they are answering a question not so much to be intrusive and give a low little whistle at the mess on their screens. And out pours a litany of wasted hours and bug reports. Now I have made a friend. Paul Ford Clive Thompson Rhetoric Villain Good software makes work easier, but bad software brings us together into a family. I love bad software, which is most of it. Friends text me screenshots of terrible procurement systems, knowing that I will immediately text back, Banana cakes. I will even watch videos of bad software. There are tons on You tube, where people demo enterprise resource - planning systems and the like. These videos fill me with a sort of yearning, like when you step inside some old frigate they have turned into a museum. Best I can tell, the bad software sweepstakes has been won ( or lost ) by climate change folks. One night I decided to go see what climate models actually are. Turns out they are often massive batch jobs that run on supercomputers and spit out numbers. No buttons to click, no spiny globes or toggle switches. they are artifacts from the deep, mainframe world of computing. When you hear about a climate model predicting awful Earth stuff, they are talking about hundreds of FORTRAN files, with comments at the top like The subroutines in this file determine the potential temperature at which seawater freezes. they are not meant to be run by any random nerd on a home computer. This does not mean they are inaccurate. they are very accurate. As code goes, the models are amazing, because they are attempts to understand the entire, actual Earth via programming. All the ocean currents, all the ice and rain, all the soil and light. And if you feel smart, reading a few pages of climate model code will fix you up tout suite. If you, too, would like to know exactly how little you know about the machinery of the natural world, go on Git hub and look through the Modular Ocean Model 6, released by the National Oceanic and Atmospheric Administration, which is part of the Department of Commerce. Only America would make the weather report to money. The software people get amazing tools that let them build amazing apps, and the climate people get lots of FORTRAN. This is one of the weirdest puzzles of this industry. Every industry or discipline has its signature software. Climate has big batch climate models. Sales has the Cm, hence Salesforce. Doctors have those awful health care records systems; social scientists use Spas or Ass or R; financial types plug everything into Excel. There are big platforms that help people do all kinds of work. But you know what blows them away? Software for making software. The software industry's software is so, so good ( not that people do not complain ). Just take a look at the modern IDE ( integrated development environment ), the programs programmers use to program more programs. The biggest are made by tech giants: Code ( Apple ) and Visual Studio ( Microsoft ) and Android Studio ( Google ), for example. I love to mock software, and yeah, these programs are huge and sprawling, but when I open these tools I feel like a medieval stonemason dragged into midtown Manhattan and left to stare at the skyscrapers. My mouth hangs open and my chisel falls from my sandstone - roughened hands. In an IDE you drag buttons around to make the scaffolding for your apps. You type a few letters and the software guides your hand and finishes your thoughts, showing you functions inside of functions and letting you pick the right one for the task. Ultimately you click a little triangle ( like Play on a music player ) and it builds the app. I never get over it. And they give it away for free, so that people use it to make more software, which is why all the real estate in New York City is worth around a trillion and a half bucks, and Apple, which takes its famous 30 percent cut in the App Store, is worth $2 trillion. Of course, that is a down payment when you consider what we are going to pay to mitigate climate change. So the software people get amazing tools that let them build amazing apps, and the climate people get lots of FORTRAN. This is one of the weirdest puzzles of this industry. We have these tools for making new, wonderful tools, and yet the people who need help the most are using these old tools and methods. A lot of it is due to a very ancient and serious split between academic programming, which is frequently optimized for doing something novel and publishing a paper about it, and the tech industry, which is, to put it simply, optimized for making lots of money by giving people things they use all the time. That whole Xerox PARC thing in the aesthete thing that supposedly gave us the Mac, etc. was actually not about having a mouse and windows; the big core idea was that we would build models of our world in software and adapt them as we explored. Doctors could simulate new treatments; children could simulate rocket ships. we would all have highly visual pocket climate models we could explore and manipulate, or the doctors would all be programmers themselves and make better patient - management systems. The idea was for software to become the humble servant of every other discipline; no one anticipated that the tech industry would become a global god - king among the industries, expecting every other field to transform itself in tech's image. there is a thing in programming: Code has a way of begetting more code. You start hacking on some problem, and six months later you are still hacking at it, adding features. You write code that helps you write more code. But what we do not do so much, what our tools do not help us do, is continually ask, who is this for, why are we doing it, and how will people build upon it? Decisions were made for us, decades ago, and here we are. Best not to dwell on what might have been. let us look around and learn. What I am learning as I read that climate code in long pandemic evenings is that the rules of the world are to be discovered and accepted, not changed. it is a hard lesson to learn, when I work in a field with such wonderful, fluid, flexible tools. It feels as if we should be able to hack our way out of this. The next phase of growth for our industry, finally, should be to learn about the world before we try to change it. This article appears in the October issue. Subscribe now. WIRED is where tomorrow is realized. It is the essential source of information and ideas that make sense of a world in constant transformation. The WIRED conversation illuminates how technology is changing every aspect of our lives from culture to business, science to design. The breakthroughs and innovations that we uncover lead to new ways of thinking, new connections, and new industries. More From WIRED Contact 2020 Cons Ants. All rights reserved. Use of this site constitutes acceptance of our User Agreement ( updated 1 / 1 / 20 ) and Privacy Policy and Cookie Statement ( updated 1 / 1 / 20 ) and Your California Privacy Rights. Wired may earn a portion of sales from products that are purchased through our site as part of our Affiliate Partnerships with retailers. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of Cons Ants. Ad Choices
	Hi everyone, After overthinking overthinking last week I thought this week Id talk about something that seems like ( but inst ) its opposite: half - arsing ( half - assign if you speak US English ). Half - arsing is when you put in less effort than would be needed to get a good result, and the result you get out of it is correspondingly worse. Its the kludge, the cutting of corners, anything where where the reason you dint do better is not that you could not, just that you could not be bothered. Like overthinking, half - arsing is good, actually, and gets an unfairly bad rep, and you should probably be doing more of it than you are. Why should you half - ares things? Generally because its harder to do it properly than the task warrants. Most things that need doing font need doing very well. They especially often font need doing well if the alternative is not doing them at all. Often we have only so much time and effort we can spend on a thing, and doing the thing is better than not doing the thing. In general, you should aim to put effort into any given thing proportionate to outcome. Something that is hidden away can be done with an ugly kludge. Something that does not meaningfully contribute to the success of the project can be skimped on. This comes up particularly when there are many things to do and a finite amount of effort to spend. In these case a refusal to half - ares is not actually a refusal to half - ares, its abdicating responsibility over which bits to half - ares. Parsons law of triviality, illustrated with the example that a committee for a nuclear power plant spends a disproportionate amount of time deciding on the color of its bike shed, is essentially an example of this. You should probably just half - ares that decision, its basically fine. Sometimes its even impossible to solve the problem within the means available to you. You might be prohibited, or having to work in the constraints set by other people ( cf. Distributing Blame ), or you might to pick a completely hypothetical example have left writing your newsletter too late and want to head over to the barbecue on your partners island to meet your new neighbors ( no more than six of us though, I promise, UK government ). In many of these cases, its still worth doing something that improves the situation even if its not the ideal scenario. So thatch why half - arsing is good, but why do people not do enough of it? Well, personally I blame school ( I try not to explain too many things this way, but I think this one really makes sense ). We spend too much of our formative years in an environment where there is an external arbiter who will punish us for not doing the thing properly, and does not allow us to use our judgment as to what counts as good enough. Too often this causes us to acquire an inappropriate degree of perfectionism, where our only options for half - archery are to either convince ourselves that the thing is unimportant, or we procrastinate until we have no choice but to half - ares in order to get things done in time. We may also end up solving this problem by never doing the thing at all because we never make time to do it properly because we know full well its not worth doing properly. I find people do this with reading a lot ( I especially blame school for this one ). I read a book yesterday, start to finish, in about an hour. By which I mean I read the start, I read the finish, and I read some of the pages in between but probably fewer than 10 % of the total content, by skipping over all of the bits that were boring or annoying. Why? Well because it wast a very good book, and it wast worth more than that, but it had enough interesting content in it and I valued the sense of the book enough that it was definitely worth an hour, but it probably wast worth two, and I could not have read it properly in less than about four. If Id insisted on a proper reading of it, Id never have read the book at all, and Id have been the worse for it. Most people are not willing to read this way, because they have an invisible teacher sitting on their shoulder judging them if they read improperly ( cf. Jimmy Cricket Must Die ). As well as preventing them from reading improperly this also prevents them from reading properly, because if you cannot half - ares a book then starting reading it dooms you to either probably spending too much time on it for what its worth, or to feeling bad about your failure to read it. Ultimately, mastering the skill of half - archery will probably net improve the quality of your work in this way, because it allows you to priorities effectively, and to judge what level of quality a project actually deserved. By fixing our inappropriate levels of perfectionism, we can invest the appropriate amount of effort into the problems that actually matter, and not get caught up on the pointless bits that font. We sent an email to with a link to finish logging in. I did this throughout college. " C's get degrees " lo The phrase " Good enough for government work " comes in handy here.

NA

We will also tokenize to words and:

remove all the digits and number digits combinations using Regex,
remove stopwords, We will revert tokenization so that we will use some other method (skipgrams) later on.

pattern = c(
    "\\b[:digit:]+\\b",
    #00 mm, 00 cm (number and unit)
    "[\\d]+[\\s]*+(mm|cm)" ,
    #abc123, abc123def (letters numbers)    
    "[A-Za-z]+[\\d]+[\\w]*",
    #123abc, 123abc32 (numbers letters)
    "[\\d]+[A-Za-z]+[\\w]*",
    #diameter
    "\u00F8[\\s]*[\\d]+[\\w]*",
    #all numbers
    "\\b\\d+\\b",
    #single letter words
    "\\b(\\w)\\b"
    #  "[\\d]+[\\w]*"           #NOT RUN: 123(word starting with numbers)
  )

pattern = paste0(pattern, collapse = "|")

df_texts <- 
  df_texts %>%
      unnest_tokens(output = word,
                    input = texts,
                    to_lower = TRUE) %>%
      # remove repeating words
      distinct(ID, word,.keep_all = TRUE) %>% 
      anti_join(get_stopwords()) %>%
      # anti_join(stopwords_extra) %>%
      mutate(word =
             str_remove_all(word,
                            pattern)) %>% 
      group_by(ID) %>%
      summarise(texts  = paste0(word, collapse = " ")) %>% 
ungroup()

Joining, by = "word"
`summarise()` ungrouping output (override with `.groups` argument)

Tokenization to skip-grams and tf-idf

df_text_tokens <-
  df_texts %>%
  unnest_skip_ngrams(
    output = word,
    input = texts,
    n = 2,
    n_min = 1,
    k = 1,
    to_lower = TRUE
  ) %>%
  count(ID, word, sort = TRUE) %>%
  bind_tf_idf(document = ID,
              term = word,
              n = n)

Let’s have first look at most frequent words, maybe some need eliminating.

df_text_tokens %>%
  group_by(word) %>% 
  summarise(n = sum(n)) %>% 
    arrange(desc(n)) %>%
    slice(1:30) %>%
    ggplot(aes(n, fct_reorder(word, n) )) +
    geom_col(fill = "lightblue", alpha = .7) +
    labs(
      x = "",
      y = "",
      subtitle = "Top words")

`summarise()` ungrouping output (override with `.groups` argument)

Thoughts? Lots of very general words.

Topic modelling.

tf-idf index.

sparse_matrix <- 
  df_text_tokens %>%
    cast_sparse(
      row         = ID, 
      column      = word, 
      value       = tf_idf, 
      giveCsparse = TRUE)


library(text2vec)


lsa = LSA$new(n_topics = 50)
dtm_tfidf_lsa  <- fit_transform(sparse_matrix, lsa)

INFO  [16:16:46.223] soft_als: iter 001, frobenious norm change 1.969 loss NA  
INFO  [16:17:01.184] soft_als: iter 002, frobenious norm change 0.330 loss NA  
INFO  [16:17:15.277] soft_als: iter 003, frobenious norm change 0.026 loss NA  
INFO  [16:17:29.862] soft_als: iter 004, frobenious norm change 0.006 loss NA  
INFO  [16:17:45.440] soft_als: iter 005, frobenious norm change 0.002 loss NA  
INFO  [16:17:59.703] soft_als: iter 006, frobenious norm change 0.001 loss NA  
INFO  [16:18:13.930] soft_als: iter 007, frobenious norm change 0.001 loss NA  
INFO  [16:18:13.938] soft_impute: converged with tol 0.001000 after 7 iter

head(dtm_tfidf_lsa)

            [,1]          [,2]          [,3]         [,4]          [,5]          [,6]         [,7]          [,8]         [,9]
1  -2.391267e-06 -1.409928e-04 -1.201064e-06 1.762378e-04 -1.348982e-06  7.158809e-05 3.437796e-04 -0.0006653092 8.837780e-05
4  -1.897320e-06 -9.019553e-05 -7.235713e-07 2.017001e-04 -8.256820e-07 -1.002800e-06 8.887208e-06 -0.0002650174 4.355586e-05
7  -8.956435e-07 -9.644083e-05 -8.946838e-07 2.117094e-04 -1.058537e-06  8.604888e-05 1.741675e-04 -0.0001732908 2.263424e-05
10 -3.995314e-06 -1.448859e-05 -1.183575e-06 3.167581e-04 -5.643982e-04  9.052750e-05 1.628678e-05 -0.0015012930 1.585240e-04
14 -5.286939e-06 -2.565596e-04 -1.076148e-06 1.549264e-06 -1.127226e-06 -1.043580e-05 6.538717e-05 -0.0004283815 2.922944e-04
16 -2.802251e-06 -1.010300e-05 -9.597516e-07 1.391868e-06 -9.140042e-07  5.375672e-06 6.893239e-05 -0.0002612126 1.330588e-04
           [,10]        [,11]        [,12]         [,13]        [,14]        [,15]        [,16]        [,17]         [,18]        [,19]
1  -2.516531e-05 2.468033e-04 4.217481e-05 -2.321767e-04 4.208964e-05 2.086998e-04 1.453846e-04 2.312298e-04 -2.980924e-04 0.0008123123
4   7.396598e-05 2.156992e-05 6.052356e-06  6.290974e-05 1.672692e-05 5.337768e-05 4.804525e-05 5.923791e-05 -8.055234e-05 0.0002213787
7  -2.599478e-05 8.509288e-05 1.395718e-05  2.331421e-05 1.865487e-05 3.708538e-04 1.835426e-04 7.552730e-05 -9.714757e-05 0.0003912537
10 -1.121514e-03 4.662963e-04 9.216746e-06 -1.586559e-03 2.792127e-04 6.534381e-05 2.231095e-04 4.751006e-04 -5.173376e-04 0.0001550933
14 -2.532853e-05 9.823843e-05 3.865015e-05  1.486636e-04 2.275199e-04 1.435266e-04 9.272283e-05 1.215958e-04 -2.863154e-04 0.0001550593
16 -1.662533e-04 3.766724e-04 1.005301e-04  8.850691e-05 2.549302e-04 2.311007e-05 4.596910e-04 1.007181e-03 -2.806936e-05 0.0004244875
           [,20]        [,21]        [,22]         [,23]        [,24]         [,25]        [,26]         [,27]         [,28]
1  -2.105697e-04 1.092516e-04 1.392429e-04 -0.0012495641 1.483300e-04 -1.535504e-05 0.0001652371 -0.0001305531 -1.553813e-05
4  -1.701375e-04 1.489839e-04 7.751468e-05 -0.0005648139 3.706478e-05  8.019054e-05 0.0000933941 -0.0002015999  3.730411e-05
7  -1.391671e-04 9.129427e-05 1.227150e-04 -0.0007331479 1.352818e-05  5.903333e-05 0.0001526516 -0.0002198376 -1.520340e-04
10 -2.361756e-04 1.206396e-04 3.840660e-04 -0.0018255881 2.800706e-04  7.183198e-06 0.0002722268 -0.0003536646  7.742986e-06
14 -3.076300e-04 1.510856e-04 2.573607e-04 -0.0006482480 1.843425e-05  9.538625e-06 0.0001450944 -0.0002603936 -7.477609e-06
16 -7.303901e-05 4.225300e-05 1.082027e-04 -0.0005865547 6.544184e-05  1.720087e-04 0.0002240274 -0.0002696237 -5.664294e-05
           [,29]        [,30]        [,31]        [,32]         [,33]        [,34]        [,35]        [,36]         [,37]        [,38]
1  -2.559803e-04 8.861066e-05 1.106151e-04 1.404456e-04 -1.579139e-04 0.0002220046 0.0002675178 1.912917e-04 -0.0002913472 8.786885e-05
4  -1.897537e-04 2.161359e-05 1.296023e-04 1.713386e-04 -4.892201e-04 0.0001534705 0.0003783296 1.575748e-05 -0.0001699031 3.610741e-04
7  -4.653282e-05 6.741506e-04 1.672250e-04 6.572251e-05 -1.438042e-04 0.0002911511 0.0002669643 2.734494e-04 -0.0001407230 8.466288e-05
10 -1.458088e-04 9.738613e-05 1.899486e-04 2.394870e-04 -3.882835e-04 0.0001896077 0.0002638832 1.655186e-04 -0.0002884454 4.234317e-04
14  2.196200e-05 4.518699e-05 8.976074e-05 7.632605e-05 -1.631301e-04 0.0002944639 0.0002472416 1.022824e-04 -0.0001609692 1.272878e-04
16 -9.872955e-06 5.404817e-05 1.885226e-04 1.009376e-04 -7.617895e-05 0.0001805294 0.0001803015 3.605970e-05 -0.0003615177 7.208686e-05
           [,39]         [,40]         [,41]         [,42]        [,43]        [,44]        [,45]        [,46]         [,47]
1  -1.736515e-05 -0.0003543700 -0.0002065605  5.025934e-05 0.0008895423 1.953057e-04 1.979355e-04 0.0006059471 -0.0001597466
4  -5.962895e-07 -0.0004036484 -0.0002175867  1.661205e-04 0.0004710478 3.168685e-04 5.759484e-05 0.0002090892 -0.0001010001
7   9.371739e-06 -0.0002672107 -0.0002978520  3.105486e-05 0.0007577935 8.033125e-05 3.618851e-06 0.0004283144 -0.0002691617
10 -2.891099e-05 -0.0003828375 -0.0002312160  1.013586e-04 0.0005623164 9.889780e-05 5.981897e-05 0.0003405318 -0.0001039645
14  1.250558e-04 -0.0002926537 -0.0002794576  5.829640e-05 0.0005294121 3.530575e-04 1.881519e-04 0.0003964338 -0.0001130630
16 -1.505048e-05 -0.0003871033 -0.0004033760 -2.925710e-06 0.0005296265 3.311317e-04 1.749196e-04 0.0003503079 -0.0001067347
          [,48]        [,49]        [,50]
1  3.007499e-04 0.0006912414 0.0007857642
4  2.003110e-04 0.0001624613 0.0001525572
7  6.948631e-04 0.0002187139 0.0001721891
10 1.376201e-04 0.0004212877 0.0002764380
14 1.950272e-04 0.0003637866 0.0005166041
16 6.521227e-05 0.0004058246 0.0001157277

Calculate distances


dist_all <- 
  text2vec::dist2(dtm_tfidf_lsa, 
       method = "cosine", 
       norm = "l2")

How many clusters are optimal

After: https://rpkgs.datanovia.com/factoextra/reference/fviz_nbclust.html

Optimal number of clusters

library(factoextra)

fviz_nbclust(dist_all,cluster::pam, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)

Average silhouette for kmeans


fviz_nbclust(dist_all,cluster::pam, method = "silhouette")


library(cluster)

gap_stat <- clusGap(dist_all,cluster::pam, 
 K.max = 10, B = 10)

Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 10)  [one "." per sample]:
.......... 10

 print(gap_stat, method = "firstmax")

Clustering Gap statistic ["clusGap"] from call:
clusGap(x = dist_all, FUNcluster = cluster::pam, K.max = 10,     B = 10)
B=10 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
 --> Number of clusters (method 'firstmax'): 10
          logW   E.logW       gap      SE.sim
 [1,] 7.461298 7.980678 0.5193797 0.006902559
 [2,] 7.085486 7.738289 0.6528028 0.005273111
 [3,] 6.946275 7.675577 0.7293014 0.007459287
 [4,] 6.812775 7.651558 0.8387826 0.004338764
 [5,] 6.771422 7.635575 0.8641532 0.005440495
 [6,] 6.726251 7.620591 0.8943399 0.005710305
 [7,] 6.689221 7.608336 0.9191145 0.006088979
 [8,] 6.649406 7.598656 0.9492501 0.006153996
 [9,] 6.602872 7.588809 0.9859370 0.005952033
[10,] 6.579018 7.580914 1.0018968 0.006203080

 fviz_gap_stat(gap_stat)

Clustering with pam




mod_pam <- cluster::pam(dist_all,k = 6 )

df_clusters <-
  tibble(
    ID = names(mod_pam$clustering) %>% as.integer(),
    cluster = mod_pam$clustering,
    row.names = NULL
  ) %>%
  arrange(as.integer(ID))

df_text_tokens <- 
  df_text_tokens %>% 
  inner_join(df_clusters, by="ID")

Compare vocabulary between clusters?

library(tidylo)

df_text_tokens_clusters <- df_text_tokens %>%
  group_by(ID) %>% 
  summarise(n = sum(n), word = first(word), cluster = first(cluster)) %>% 
  ungroup() %>% 
  bind_log_odds(cluster, word, n)

`summarise()` ungrouping output (override with `.groups` argument)


df_text_tokens_clusters  %>%
  arrange(-log_odds_weighted) %>%
  mutate(cluster = as.factor(cluster)) %>% 
  group_by(cluster) %>% 
  slice_head(n=20) %>%
  ungroup %>%
#  mutate(word = reorder_within(word, log_odds_weighted)) %>%
  ggplot(
    aes(      
      log_odds_weighted,
      reorder_within(word, log_odds_weighted, cluster), 
 
      fill = cluster)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~cluster, scales = "free") +
  scale_y_reordered()+
  labs(x = NULL)

Literature:

NLP

https://text2vec.org (Dmitriy Selivanov)
http://tidytext.org (Julia Silge, David Robinson)

Tidylo

https://github.com/juliasilge/tidylo

Clustering

https://rpkgs.datanovia.com/factoextra/reference/fviz_nbclust.html

RStats Daimyo Shimada Blog

Thursday, September 24, 2020

WhyR Hackathon 2020 Clustering texts

WhyR Hackathon 2020

Clustering texts

2020-09-24