Stories
Slash Boxes
Comments

SoylentNews is people

posted by janrinok on Thursday September 19 2019, @03:09PM   Printer-friendly
from the perl-one-liners dept.

Back in May, writer Jun Wu told in her blog how Perl excels at text manipulation. She often uses it to tidy data sets, a necessity as data is often collected with variations and cleaning it up before use is a necessity. She goes through many one-liners which help make that easy.

Having old reliables is my key to success. Ever since I learned Perl during the dot com bubble, I knew that I was forever beholden to its powers to transform.

You heard me. Freedom is the word here with Perl.

When I'm coding freely at home on my fun data science project, I rely on it to clean up my data.

In the real world, data is often collected with loads of variations. Unless you are using someone's "clean" dataset, you better learn to clean that data real fast.


Original Submission

 
This discussion has been archived. No new comments can be posted.
Display Options Threshold/Breakthrough Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • (Score: 4, Insightful) by The Mighty Buzzard on Thursday September 19 2019, @03:13PM (29 children)

    by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 19 2019, @03:13PM (#896105) Homepage Journal

    Said it before and I'll say it again, there is absolutely nothing as good as perl for random text wrangling unless you're dealing with data sets so enormous that you absolutely have to have the fastest running code possible. For most applications you will never make up the extra time spent coding it in any other language.

    --
    My rights don't end where your fear begins.
    Starting Score:    1  point
    Moderation   +2  
       Insightful=2, Total=2
    Extra 'Insightful' Modifier   0  
    Karma-Bonus Modifier   +1  

    Total Score:   4  
  • (Score: 3, Interesting) by JoeMerchant on Thursday September 19 2019, @03:27PM (16 children)

    by JoeMerchant (3937) on Thursday September 19 2019, @03:27PM (#896115)

    I live mostly inside my C++ bubble, further subset into the Qt API. In here, QString is pretty damn good at wrangling string issues, and when it's not enough you can always bail out to RegEx (sounds like: retch, for a reason I think.)

    So much is just whatever you are familiar with, Boost, stdlib, whatever... if you know how to use them they've got most of the tools you need pre-coded, and if you find yourself doing the same thing over and over that takes more than 2 lines to accomplish, that sounds like time for a personal library extension...

    --
    🌻🌻 [google.com]
    • (Score: 3, Insightful) by The Mighty Buzzard on Thursday September 19 2019, @03:37PM (5 children)

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 19 2019, @03:37PM (#896121) Homepage Journal

      Kind of the point. Using perl you can do a whole hell of a lot in one line. I've used a lot of languages over the years and still pick new ones up for fun every year or two and I've never found anything even close to as versatile and efficient with text as perl. Python usually takes a minimum of three times as many lines to accomplish what perl can do legibly in one, five to ten is more common.

      --
      My rights don't end where your fear begins.
      • (Score: 2) by JoeMerchant on Thursday September 19 2019, @03:49PM (4 children)

        by JoeMerchant (3937) on Thursday September 19 2019, @03:49PM (#896126)

        Yeah, one-liners are good - I really appreciate having an easy GUI that I can just throw open a scrolling text box in, append text to it all day long with single line commands, HTML format that text if I feel like it, maybe toss on a few checkboxes to toggle boolean control variables (like command line switches, but changeable at runtime...)

        Again, it's all in what you're used to. Today, I'm appreciating the verbose log files that make it relatively easy to spot what went weird when the testers come up with their 1/10,000 behaviors.

        --
        🌻🌻 [google.com]
        • (Score: 3, Interesting) by The Mighty Buzzard on Thursday September 19 2019, @03:57PM (3 children)

          by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 19 2019, @03:57PM (#896130) Homepage Journal

          Honestly, I mostly use grep, awk, and sed for most one-liner type stuff. Perl is overkill for the extremely simple stuff. I mostly use it when I need something at least slightly more complex. The ability to get way more work done per readable line is just as useful in a script as on a command line though.

          --
          My rights don't end where your fear begins.
          • (Score: 0) by Anonymous Coward on Friday September 20 2019, @05:53AM (2 children)

            by Anonymous Coward on Friday September 20 2019, @05:53AM (#896404)

            I find grep and sed incredibly useful and intuitive, while never quite grokking awk. Don't know why.

            • (Score: 0) by Anonymous Coward on Friday September 20 2019, @08:57AM (1 child)

              by Anonymous Coward on Friday September 20 2019, @08:57AM (#896434)

              I find grep and sed incredibly useful and intuitive, while never quite grokking awk. Don't know why.

              Because awk is for parsing of stuff, especially column oriented documents. If you want 5th column of something, for example. But if you don't deal with column data, then you probably would never need awk.

              • (Score: 0) by Anonymous Coward on Friday September 20 2019, @11:00AM

                by Anonymous Coward on Friday September 20 2019, @11:00AM (#896452)

                Awk is really good at stuff you would normally have to pipe sed and grep for, you can use one simple awk statement. It also has some formatting capabilities so I like to use it when writing shell functions.

    • (Score: 0) by Anonymous Coward on Thursday September 19 2019, @03:48PM (4 children)

      by Anonymous Coward on Thursday September 19 2019, @03:48PM (#896125)

      Your bubble will burst when touched by Unicode multilanguage data mixed with funny math/geometry/engineering symbols and true emoji.

      • (Score: 2) by The Mighty Buzzard on Thursday September 19 2019, @03:52PM (2 children)

        by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 19 2019, @03:52PM (#896128) Homepage Journal

        Perl's kind of shitty at that too unless you know the few simple tricks to make it not shitty at it.

        --
        My rights don't end where your fear begins.
        • (Score: 0) by Anonymous Coward on Friday September 20 2019, @03:20AM (1 child)

          by Anonymous Coward on Friday September 20 2019, @03:20AM (#896367)

          Really? Perl was the first language to manage that properly for me. Yeah, sometimes you need to be explicit in unusual ways, but at least you _can_ without jumping through hoops. I think this gets thrown in the "tricks" category just because it so rarely comes up that you're probably going to need to look it up when it does, cuz otherwise you just pick an encoding and forget about it.

      • (Score: 2) by JoeMerchant on Thursday September 19 2019, @08:55PM

        by JoeMerchant (3937) on Thursday September 19 2019, @08:55PM (#896252)

        Funny thing, QString handles Unicode, UTF-8, etc. conversions pretty much seamlessly, as do the other modern string classes. Thank God for that, because the last thing I want to fool with is conversion between Unicode and UTF-8.

        --
        🌻🌻 [google.com]
    • (Score: 3, Insightful) by legont on Thursday September 19 2019, @07:09PM (4 children)

      by legont (4179) on Thursday September 19 2019, @07:09PM (#896214)

      that sounds like time for a personal library extension...

      Yeah, but once one finishes building it, one discovers that he rebuild Perl.

      --
      "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
      • (Score: 2) by JoeMerchant on Thursday September 19 2019, @08:59PM (3 children)

        by JoeMerchant (3937) on Thursday September 19 2019, @08:59PM (#896254)

        Yeah, but once one finishes building it, one discovers that he rebuild Perl.

        If you're that heavy into what Perl does, by all means, use it. For me, it's a bigger PITA to "shell out" to get access to Perl than it is to recode in C++, on the rare occasions it is necessary. If I really loved Perl so much but still needed to be in C++, I'd make a dedicated wrapper for Perl and get full access that way.

        BTW: anyone considering PyQt for anything bigger than a toy project, my recommendation is: don't. But, then, that's pretty much my recommendation for Python all over - sure, there are some pretty impressive "large" things out there mostly based in Python - like trac, which I have happily used for over 10 years now, but... for the most part, unless the coding team is super disciplined, Python degenerates into a ball of snakes much faster than any of the C derivatives I have ever worked with.

        --
        🌻🌻 [google.com]
        • (Score: 2) by legont on Friday September 20 2019, @01:01AM (2 children)

          by legont (4179) on Friday September 20 2019, @01:01AM (#896323)

          My main choice for a long time was C (plain, without ++). For quick and dirty things I'd use AWK and I am talking here not about one liners, but full blown software of a few hundreds or even thousands lines of code. There was even an AWK compiler that one guy wrote and was selling for $99 that did a very good job.

          At some point I discovered Perl, gave it a try, and it replaced AWK, even compiled version of it, for me. Time passing, I realized that I pretty much stopped using C except in rare special occasions and that Perl would cover everything for me.

          Management would be forcing at different times Java, dotnet, Python and so on, but so far at the end Perl could not be replaced. There is another attempt going right now and this time they may succeed, but let's see...

          I appreciate your comment about Python, but if you were asked to replace a huge Perl project with something modern that fresh college kids would like, what would you recommend?

          --
          "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
          • (Score: 4, Insightful) by JoeMerchant on Friday September 20 2019, @01:11PM (1 child)

            by JoeMerchant (3937) on Friday September 20 2019, @01:11PM (#896478)

            if you were asked to replace a huge Perl project with something modern that fresh college kids would like, what would you recommend?

            Perl.

            I worked for almost a decade converting fresh college kids' code (Matlab, Python, and strangely: a fair bit of Fortran) to C++ so that their broken toys could be sold to real customers.

            --
            🌻🌻 [google.com]
            • (Score: 2) by legont on Friday September 20 2019, @05:54PM

              by legont (4179) on Friday September 20 2019, @05:54PM (#896583)

              Yes, my thoughts exactly.

              --
              "Wealth is the relentless enemy of understanding" - John Kenneth Galbraith.
  • (Score: 2) by Thexalon on Thursday September 19 2019, @04:18PM (8 children)

    by Thexalon (636) on Thursday September 19 2019, @04:18PM (#896135)

    Perl's great for this ... unless there's a simple known sed/awk approach to the problem, because those are both a bit faster and simpler.

    --
    The only thing that stops a bad guy with a compiler is a good guy with a compiler.
    • (Score: 2) by The Mighty Buzzard on Thursday September 19 2019, @05:00PM (7 children)

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Thursday September 19 2019, @05:00PM (#896154) Homepage Journal

      Simpler, yes, faster, maybe. After three to five pipes it starts making more sense to break out the perl.

      --
      My rights don't end where your fear begins.
      • (Score: 2) by Thexalon on Thursday September 19 2019, @05:06PM

        by Thexalon (636) on Thursday September 19 2019, @05:06PM (#896158)

        Yeah, I'm talking about the near-1-liners that are sometimes available. I agree that if you start getting into complexity, then yes, bust out the Perl.

        --
        The only thing that stops a bad guy with a compiler is a good guy with a compiler.
      • (Score: 0) by Anonymous Coward on Friday September 20 2019, @12:22AM (5 children)

        by Anonymous Coward on Friday September 20 2019, @12:22AM (#896303)

        Though there's the extra advantage from shell pipelines: free smp support. Each process is (obviously) a separate process so each stage of your pipeline can get its own core without writing multi-threaded code.

        • (Score: 2) by The Mighty Buzzard on Friday September 20 2019, @01:32AM (4 children)

          by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 20 2019, @01:32AM (#896335) Homepage Journal

          Not a huge help when they're run sequentially, as piped commands are.

          --
          My rights don't end where your fear begins.
          • (Score: 1, Interesting) by Anonymous Coward on Friday September 20 2019, @04:06AM (3 children)

            by Anonymous Coward on Friday September 20 2019, @04:06AM (#896385)

            In MS-DOS they ran sequentially, in Unix/Linux the programs run in parallel and consume each other's output as soon as it becomes available. Run 'find | cat' on a large directory tree. The output in the terminal starts appearing immedeately, which can only happen if cat runs parallel to find. Or run 'find | rev' if you want to see the second command make a visible change.

            • (Score: 2) by The Mighty Buzzard on Friday September 20 2019, @07:24PM (2 children)

              by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 20 2019, @07:24PM (#896621) Homepage Journal

              In Linux/*BSD they effectively run sequentially as well in most cases as well. There's a pretty small subset of commands that output useful information as they run.

              --
              My rights don't end where your fear begins.
              • (Score: 0) by Anonymous Coward on Friday September 20 2019, @08:41PM (1 child)

                by Anonymous Coward on Friday September 20 2019, @08:41PM (#896639)

                Don't they run in reverse sequence for all shell redirection? If you run 'echo "test" | less' , less starts first, opening the fd that echo then prints out to.

  • (Score: 0) by Anonymous Coward on Thursday September 19 2019, @09:30PM (2 children)

    by Anonymous Coward on Thursday September 19 2019, @09:30PM (#896265)

    We use Perl for data sets in the TB range at work and it's fine.

    • (Score: 2) by The Mighty Buzzard on Friday September 20 2019, @01:35AM (1 child)

      by The Mighty Buzzard (18) Subscriber Badge <themightybuzzard@proton.me> on Friday September 20 2019, @01:35AM (#896337) Homepage Journal

      Depending on the use case, it can be, sure. If you're trying to serve up thousands of pages a second off that dataset or some such, I'd go with something a little more close to the metal so you can use an optimized function that does only what you need and does it in as few cycles as possible.

      --
      My rights don't end where your fear begins.
      • (Score: 0) by Anonymous Coward on Friday September 20 2019, @08:03PM

        by Anonymous Coward on Friday September 20 2019, @08:03PM (#896634)

        We're using Vertica (a column-oriented SQL database) for the high performance queries. But for loading the data into Vertica, we use Perl.