Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Monitoring Process of Cases[] on a Very Large Body of Information

I'm currently undertaking operations on a very large body of text (~290MB of plain text in one file). After importing it into Mathematica 8, I'm currently beginning operations to break it down into lowercase words, etc. so I can begin textual analysis.

The problem is that these processes take a long time. Would there be a way to monitor these operations through Mathematica? For operations with a variable, I've used ProgressIndicator etc. But this is different. My searching of documentation and StackOverflow has not turned up anything similar.

In the following, I would like to monitor the process of the Cases[ ] command:

input=Import["/users/USER/alltext.txt"];
wordList=Cases[StringSplit[ToLowerCase[input],Except[WordCharacter]],Except[""]];
like image 443
canadian_scholar Avatar asked Oct 18 '11 02:10

canadian_scholar


People also ask

What is transaction monitoring process?

Transaction monitoring refers to the monitoring of customer transactions, including assessing historical/current customer information and interactions to provide a complete picture of customer activity. This can include transfers, deposits, and withdrawals.

How do you monitor suspicious transactions?

Transaction Monitoring And AML Regulations Companies can detect suspicious financial activities using AML Transaction Monitoring and write a SAR to report the activity to local regulators such as Financial Crimes Enforcement Network (FinCEN) and global regulators such as Financial Action Task Force (FATF).

What is AML monitoring?

Anti-money laundering transaction monitoring is a monitoring process designed to ensure companies do not help criminals launder money. By using AML software and tools to monitor transactions over a certain amount, you should be able to raise red flags before you accidentally help launder money.

What is transaction monitoring tool?

What Is Transaction Monitoring Software? Transaction monitoring software helps financial institutions automatically spot suspicious transactions, such as high-value cash deposits or unusual account activity. It is a key part of the AML – anti-money laundering – process, which is heavily regulated by government bodies.


3 Answers

Something like StringCases[ToLowerCase[input], WordCharacter..] seems to be a little faster. And I would probably use DeleteCases[expr, ""] instead of Cases[expr, Except[""]].

like image 189
Joshua Martell Avatar answered Oct 13 '22 04:10

Joshua Martell


It is possible to view the progress of the StringSplit and Cases operations by injecting "counter" operations into the patterns being matched. The following code temporarily shows two progress bars: the first showing the number of characters processed by StringSplit and the second showing the number of words processed by Cases:

input = ExampleData[{"Text", "PrideAndPrejudice"}];

wordList =
  Module[{charCount = 0, wordCount = 0, allWords}
  , PrintTemporary[
      Row[
        { "Characters: "
        , ProgressIndicator[Dynamic[charCount], {0, StringLength@input}]
        }]]

  ; allWords = StringSplit[
        ToLowerCase[input]
      , (_ /; (++charCount; False)) | Except[WordCharacter]
      ]

  ; PrintTemporary[
      Row[
        { "Words:      "
        , ProgressIndicator[Dynamic[wordCount], {0, Length@allWords}]
        }]]

  ; Cases[allWords, (_ /; (++wordCount; False)) | Except[""]]

  ]

The key to the technique is that the patterns used in both cases match against the wildcard _. However, that wildcard is guarded by a condition that always fails -- but not until it has incremented a counter as a side effect. The "real" match condition is then processed as an alternative.

like image 10
WReach Avatar answered Oct 13 '22 03:10

WReach


It depends a little on what your text looks like, but you could try splitting the text into chunks and iterate over those. You could then monitor the iterator using Monitor to see the progress. For example, if your text consists of lines of text terminated by a newline you could do something like this

Module[{list, t = 0},
 list = ReadList["/users/USER/alltext.txt", "String"];
 Monitor[wordlist = 
   Flatten@Table[
     StringCases[ToLowerCase[list[[t]]], WordCharacter ..], 
      {t, Length[list]}], 
  Labeled[ProgressIndicator[t/Length[list]], N@t/Length[list], Right]];
 Print["Ready"]] 

On a file of about 3 MB this took only marginally more time than Joshua's suggestion.

like image 5
Heike Avatar answered Oct 13 '22 02:10

Heike